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Term unification plays an important role in many areas of computer science, especially in those 
related to logic. The universal mechanism of grammar-based compression for terms, in particular 
the so-called Singleton Tree Grammars (STG), have recently drawn considerable attention. Using 
STGs, terms of exponential size and height can be represented in linear space. Furthermore, the 
term representation by directed acyclic graphs (dags) can be efficiently simulated. The present 
paper is the result of an investigation on term unification and matching when the terms given as 
input are represented using different compression mechanisms for terms such as dags and Singleton 
Tree Grammars. We describe a polynomial time algorithm for context matching with dags, when 
the number of different context variables is fixed for the problem. For the same problem, NP- 
completeness is obtained when the terms are represented using the more general formalism of 
Singleton Tree Grammars. For first-order unification and matching polynomial time algorithms 
are presented, each of them improving previous results for those problems. 

Categories and Subject Descriptors: F.4.1 [Theory of computation]; Mathematical Logic and 
formal languages — lambda calculus and related systems; F.4.2 [Theory of computation]: Math- 
ematical Logic and formal languages — Grammars and Other Rewriting Systems 

General Terms: Algorithms 
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1. INTRODUCTION 

The task of solving equations is an important component of any mathematically 
founded science. In general, solving an equation s ^ t consists of finding a substi- 
tution a for variables occurring in both expressions s and t such that a{s) = a(t) 
holds. The range for the variables, the kind of expressions s and t, and their seman- 
tics, as well as the semantics of = depend on the context. By specifying some of 
these parameters we can define the well-known first-order term unification problem. 
In the context of this problem the expressions s and t are terms with leaf variables 
standing for terms, all function symbols are non-interpreted, and = is interpreted 



Author's addresses: LSI Department, Universitat Politecnica de Catalunya. 
Jordi Girona, 1-3 08034 Barcelona, Spain. 

Inst. f. Informatik, Goethe-Universitat, D-60054 Frankfurt, Germany. 

Permission to make digital/hard copy of all or part of this material without fee for personal 
or classroom use provided that the copies are not made or distributed for profit or commercial 
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and 
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, 
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. 
© 2010 ACM 1529-3785/10/0300-00000 $5.00 

ACM Transactions on Computational Logic, Vol. 0, No. 0, March 2010, Pages 0-0??. 



Unification and Matching on Compressed Terms • 1 

as syntactic equality. 

The term matching problem is a particular case of term unification. It is charac- 
terized by the condition that one of the sides of the equation s = t, say t, contains 
no variables. Like term unification, this is a common problem in areas like func- 
tional and logic programming, automated deduction, deductive databases, artificial 
intelligence, information retrieval, compiler design, type checking in programming 
languages, etc. 

The first-order term unification and matching problems are efficiently solvable 
and there is a history of different algorithms: decidable, but exponential [Robin- 
son 1965]; linear time, but using a very special term representation [Paterson and 
Wegman 1978]; and an almost linear one, using a variant of term compression: 
[Martelli and Montanari 1982]. The expressivity of first-order terms is often in- 
sufficient to deal with some of the current challenges in the mentioned areas. 
This motivates the study of some variants and generalizations of the first-order 
term matching and unification problems. In this sense, incorporating more com- 
plex interpretations of the function symbols and equality predicates under equa- 
tional theories has been widely considered (see [Baader and Siekmann 1994; Baader 
and Snyder 2001a]). Further extensions like allowing other kinds of variables re- 
lated to terms have also been explored. This is the case of context variables, i.e. 
variables that can be substituted by contexts, which arc trees with a single hole 
(syntactically, the hole is a special constant). For example, consider the term 
t = f{g(a, b),g{a, h(b))). Then the match-equation F(a) = t, where _F is a context 
variable, has the solutions F f{g{[-],b),g(a,h{b))) (where [•] means the hole) 
and F ^ f{g{a,b),g{[-],h{b))); the equation f {F (b) , F {h{b))) = t has the solu- 
tion F H> g(a, [•]), whereas f(F{b),F{b)) = t has no solution. Context matching 
is known to be NP-complete, but there are several subcases that can be solved 
efficiently [Schmidt-Schaufi and Stuber 2004]. 

As illustrated by the example above, the instantiation of a context variable by 
a match is a context, i.e. a tree with a hole. Thus multiple occurrences of the 
same context variable correspond to the question whether there are occurrences of 
the same subtree, but up to one position in the subtree. This has applications in 
computational linguistics [Niehren et al. 1997]. It is also easy to encode questions 
that ask for subtrees that are equal up to several positions. Some applications of 
context matching are in querying XML-data bases: see [Berglund et al. 2007] for 
the XPATH-standard, [Schmidt-Schaui3 and Stuber 2004] for investigating context 
matching, and [Gottlob et al. 2006] for analyzing conjunctive query mechanisms over 
trees. Another interesting application of context matching is the search within tree 
structures and the corresponding extraction of information. For example, the match 
equation F{s) = t where t is ground and s has no occurrences of F corresponds to 
the question whether there is a subtree of t that is matched by s. This can easily 
be combined as conjunctive search i^i(si) = t; . . . ; Fn{sn) = t, where the Fi are 
pairwise different and do not occur elsewhere. These match equations correspond 
to the search question whether there arc subterms r; of t that can be matched by 
Si for i = 1, . . . ,n, where variables within Sj must have a common instance in t. 

Besides adding expressiveness to these problems it is also necessary to take ac- 
count of the feasibility of implementing the algorithms to which we refer. In that 
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sense, an option is to reconsider complexity issues for the original problems or its 
variants by assuming that the input terms are given in some compressed represen- 
tation. Many of the applications dealing with the problems wc have introduced 
and their variants require some kind of internal succinct representation for terms, 
in order to guarantee computability in an environment with a limited amount of re- 
sources. It is well-known that first-order unification may require exponential space 
with a plain term representation whereas only polynomial space is required when 
dags are used for representing terms. Similarly, if terms are large but have lots of 
common subterms, like ti = f(a,b), t2 = ■ ■ ■ ,tn = f(tn-i,tn-i), then the 

context matching equation F(a) = tn requires exponential space using the plain 
term representation to represent whereas a dag representation requires linear 
space. This motivates to investigate context matching with compression techniques 
like dags. Although the context matching problem is NP-complete, sometimes it 
suffices to consider a small number of context variables, which can be thought of 
as fixed for the problem. This kind of restriction has already been considered for 
context unification restricted to two context variables [Schmidt-SchauB and Schulz 
2002], and also proved useful in the context of program verification with proce- 
dure calls [Gulwani and Tiwari 2007; Gascon et al. 2009], where context unification 
for (monadic and multi-ary, respectively) signatures and a single context variable 
allows the automatic generation of invariants. 

Besides the dag representation, more general grammar-based compression mech- 
anisms for terms have recently drawn considerable attention in research. In partic- 
ular, a Singleton Tree Grammars (STG) can succinctly represent terms/trees which 
are exponentially big in size and height. Grammar-based compression techniques 
were initially applied to words [Plandowski 1994; Plandowski and Rytter 1999] 
and led to important results in string processing, with applications [Hirao et al. 
2000; Genest and MuschoU 2002; Lasota and Rytter 2006] in software/hardware 
verification, information retrieval, and bioinformatics. Efficient algorithms have 
been developed for checking whether two compressed inputs represent the same 
word/term [Plandowski 1995; Lohrey 2006; Lifshits 2007], also randomized algo- 
rithms for the equality test [Gasieniec et al. 1996; Berman et al. 2002; Schmidt- 
Schaufi and Schnitger 2009] , and for finding occurrences of one of them within the 
other (fully compressed pattern matching) [Karpinski et al. 1995; Karpinski et al. 
1996; Miyazaki et al. 1997; Lifshits 2007]. In that sense, Straight-Line Programs 
(SLP), or the equivalent formalism of Singleton Context Free Grammars (SCFG) 
are now a widely accepted formalism for text compression. Essentially, an SCFG, 
i.e. a context free grammar where all non-terminals generate a singleton language, 
is used for representing single words. This notion was extended from words to 
terms [Busatto et al. 2005; Schmidt-Schaufi 2005; Comon et al. 1997] such that ev- 
ery non-terminal in a Singleton Tree Grammar (STG) represents cixactly one tree. 
It led to applications in XML tree structure compression [Busatto et al. 2005] and 
XPATH [Lohrey and Maneth 2005]. STGs have also been proved useful for com- 
plexity analysis of unification algorithms in [Levy et al. 2006b; 2006a]. Recently, it 
was shown that tree grammars using multi-hole-contexts are polynomially equiva- 
lent to STGs [Lohrey et al. 2009]. Moreover, STG-based compressors have already 
been developed [Maneth et al. 2008]. 
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Our focus is not on how to compress terms (which can be found e.g. in [Schmidt- 
Schaufi 2005; Busatto et al. 2008] and for deduction system in [Cheney 1998; Graf 
1995; 1996]), but on efficient algorithms on compressed terms. This paper is an 
extended and improved version of two conference papers [Gascon et al. 2009; 2008] 
by the same authors. In [Gascon et al. 2008], the context matching problem for a 
fixed number of context variables where the input terms are represented with dags 
was proven decidable in polynomial time. In our view this result shows that com- 
pression is well-behaved for context matching and should be considered. Further- 
more, NP-completeness was shown for the context matching problem with terms 
compressed using STGs. In the present paper we improve the description of both 
algorithms with additional remarks and a more rigorous notation (by representing 
dags as a particular case of STGs). This change in notation allows to be more 
precise in explanations, proofs, and even in the complexity analysis. These two 
results are presented in Section 3, and Section 5, respectively, where it is shown 
that context matching with dags with k context variables can be solved in time 
0((depth(G))'=+^|G|2log(|G|)), where |G| is the size of the initial dag G (see Theo- 
rem 3.19). As a complement we prove that context matching with STG-compressed 
terms is NP-complete (see Theorem 5.10). Section 4 contains several technical algo- 
rithms and constructions on SCFGs and STGs, which are indispensable for showing 
polynomial space and/or time behavior of the matching and unification algorithms. 
Also in [Gascon et al. 2008], and in [Gascon et al. 2009], there were polynomial 
time algorithms presented for the first-order matching and first-order unification 
problems, in both cases with terms represented with STGs. As a novel contribu- 
tion we describe, in Section 6 and Section 7, faster algorithms for these two prob- 
lems: The first-order unification algorithm on STG-compressed terms runs in time 
0(1^1 (|G|-|-|F|depth(G))3), where V is the set of variables, and G is the input STG 
(see Theorem 6.2) and the matching algorithm in time 0((|G| -I- |F|depth(G))"^, (see 
Theorem 7.3). Moreover, we believe that the presented solutions are also a gain in 
simplicity which makes them easily implement able. 

2. PRELIMINARIES 

A signature is a set together with a function ar : ^ — >■ N. Members of T are 
called function symbols, and ar(/) is called the arity of the function symbol /. 
Function symbols of arity are called constants. Let A" be a set disjoint from T 
whose elements are called variables. We assume the function ar to be also defined 
for variables, i.e. ar : (7" U A') -s- N, but with ar(F) e {0, 1} for variables V e X. 
Variables with arity 0, denoted x, y, z with possible indexes, are called first-order 
variables, and variables with arity 1, denoted F with possible subscripts, are called 
context variables. We use /, 5, with possible indexes, for denoting an element of J", 
and a for denoting an element in FVJX. 

The set T(J-" U X) of terms over F and X, also denoted T(~^, X), is defined to 
be the smallest set having the property that a{ti, . . . , tm) S T{J^ U X) whenever 
a G {TlJX), m = ar(a) and ti,. . . ,tm & T{TL)X). The set T{T) is called the set 
of ground terms over J^, that is, the subset of terms of T{J^UX) with no occurrences 
of variables. We denote by s, t, with possible indexes, terms in T{J^ U X). 

The size \t\ of a term t is the number of occurrences of variables and function 
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symbols in t. The height of a term t, denoted height(f), is if i is a constant or a 
first-order variable, and 1 + max{height(ti), . . . , height (tm)} if t = a{ti, . . . ,tm)- 
Positions of a term t, denoted q with possible subindexes, are sequences of natural 
numbers that are used to identify the location of subterms of t. The set Pos{t) of 
positions of t is defined by Pos{t) = {A} if t is a constant or a variable, and Pos{t) = 
{\}U{l-p\pe Pos{ti)} U...U{n-p\pe Pos{tm)} if t = a{ti, where 
A denotes the empty sequence and p ■ q, or simply pq, denotes the concatenation 
of p and q. If t is a term and p a position, then t\p is the subterm of t at position 
p. More formally defined, t\x = t and a(ti, . . . ,tm)\i-p = ti\p. We can define a 
partial order ^ on Pos{t) by p ^ g if and only if p is a prefix of q, i.e. there is a 
sequence p' such that q = p ■ p'- We say that positions p and q are disjoint if they 
are incomparable with respect to We denote by pre(t) the preorder traversal 
(as a word) of a term t. It is recursively defined as pre(t) = t, if t has arity 0, 
and pre(t) = a ■ pre(fi) • . . . • pTe{tm), if i = a{ti, . . . , tm)- Two arbitrary different 
trees may have the same preorder traversal, but when they represent terms over 
a fixed signature where the arity of every function symbol is fixed, the preorder 
traversal is unique for every term. Given a term t, there is a natural bijective 
mapping between the indexes {1, . . . , |pre(/:)|} of pre (t) and the positions Pos(f) of 
t, which associates every position p G Pos(t) to the index i € {1, . . . , |pre(t)|} you 
find at root{t\p) while traversing the tree in preorder. We can recursively define the 
two mappings plndex(f,p) — > {1, . . . , |pre(t)|} and iPos(t,i) — > Pos(t) as follows. 
pIndex(t,A) = 1, pIndex(Q;(ii, . . . , i.p) = (1 + |fi| + . . . + +plndex(ii,p), 
iPos(i, 1) = A, and iPos{a{ti, 1 + + ... + + k) = i.iPos{ti, k) for 

1 < A: < \t,\. 

Intuitively, contexts are terms with a single occurrence of a hole [•] into which 
terms (or other contexts) may be inserted. We denote contexts by upper case letters 
C, D. The set of contexts over T and X is denoted by C{T U X) whereas the set 
of ground contexts over T is denoted C(J^). We can provide a formal definition by 
considering a context to be a term in an extended signature that includes an extra 
constant symbol [•], and where this symbol occurs exactly once in the term. Hence, 
the smallest context contains just the hole and has size 1. If C and D are contexts 
and s is a term, CD and Cs represent the context and the term that are like C 
except that the occurrence of [•] is replaced by D and s, respectively. If Di = -D2-D3 
for contexts Di, D2, -D3, then D2 is called a prefix of Di, and D3 is called a suffix of 
Di. The position of the hole in a context C is called hole path, and denoted hp(C), 
and its length is denoted as |hp(C)|. 

A substitution is a mapping X T{F, X) U C{F, X) relating first-order vari- 
ables to terms, and context variables to contexts. Substitutions can also be ap- 
plied to arbitrary terms by homomorphically extending them by a{f{ti, . . . , tm)) = 
/(a(ti), . . . , a{t^)) and a{F{t)) = a{F)a(t). 

An instance of the context unification problem is a set of equations A = {si = 
ti, . . . , s„ = tn}, where the Sj, ti are terms in T{J^, X). The question is to compute 
a substitution a (the solution), such that (T{si) = a{ti) for all i. The context 
matching problem is a particular case of context unification where one of the sides 
of each equation in A is ground. 

With [i, n] we denote the set + 1, . . . ,n} C N. 
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2.1 Compressed term representation 

Definition 2.1. A singleton context-free grammar (SCFG) G is a 3-tuple 
(TV, S,i?), where AT is a set of non-terminals, E is a set of symbols (a signature), 
and -R is a set of rules of the form N ^ a where N £ Af and a e {J\f U E)*. 
The sets Af and S must be disjoint, each non-terminal X appears as a left-hand 
side of just one rule of R, and there exists a well founded ordering >g' such that 
X — >■ riYr2 € R implies X >g Y fm any X,Y € Af. The word generated by a 
non-terminal N of G, denoted by wg,n or wn when G is clear from the context, is 
the word in S* reached from N by successive applications of the rules of G. 

Definition 2.2. A singleton tree grammar (STG) is a 4-tuple G = 
(TAf, CAf, E, R), where TAf are tree/term non-terminals, or non-terminals of arity 
0, CAf are context non- terminals, or non-terminals of arity 1, and E is a signature 
of function symbols (the terminals) , such that the sets TAf, CAf, and E are pairwise 
disjoint. The set of non-terminals Af is defined as Af = TAf U CAf. The rules in R 
may be of the form: 

— A — >■ a{Ai , . . . , Am), where A, Ai e TAf, and a e E is an m-ary terminal symbol. 
— A C1A2 where A,A2 €T M, and Ci G CAf. 

—C [■] where C e CAf. 

—C ^ C1C2, where C, Ci, C2 G CAf. 

—C a{Ai,. . . ,Ai_x,Ci,Ai+i,. . . ,Am), where Ai,. . . ,Ai_i,Ai+i,. . . ,Am G 

TAf, C, Ci G CAf, and a G E is an m-ary terminal symbol. 
— A —5- Ai, (A-rule) where A and Ai are term non-terminals. 

Let A^i >G N2 for two non-terminals Ni,N2, iff (A^i t), and N2 occurs in t. The 
STG must be non-recursive, i.e. the transitive closure >q must be terminating. 
Furthermore, for every non-terminal N oi G there is exactly one rule having N as 
left-hand side. Sometimes we refer to the right-hand side of this rule as the definition 
of N in G. Given a term t with occurrences of non-terminals, the derivation of t by 
G is an exhaustive iterated replacement of the non-terminals by the corresponding 
right-hand sides. The result is denoted as wq_i. In the case of a non-terminal N 
we also say that TV generates wg,n. We will write wn when G is clear from the 
context. 

Note that we have used E instead of T for denoting the set of terminals of 
the grammar, although it is also a signature. We explain the reasons as follows. 
In this paper, STGs are used for representing terms and contexts. In particular, 
a terminal A of an STG G generates a term. If S was T we would be able to 
represent just ground terms. But we want to represent non-ground terms, i.e. 
terms with occurrences of first-order and context variables. Thus, E must also 
contain variables, of arity if they are first-order variables, and of arity 1 if they 
are context variables. We will represent a substitution application {V t} by 
converting the variable V from a terminal into a non-terminal of the grammar and 
adding the necessary rules such that it generates t. Thus, in this setting, variables 
can be represented both by terminals and non-terminals of the grammar. 

Given an STG G = {TAf, CAf, E, R) we can refer to the set T{TAf UCAfuE) of 
terms over the terminals and non-terminals of G where symbols in TAf have arity 
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and symbols in CN have arity 1. Similarly, we can refer to the set C{TAfL}CJ\fU'E) 
of contexts over the terminals and non-terminals of G. 

With respect to the notation used in this paper, wc denote indiscriminately terms 
in T{TAf U CM U S), and T(J-', X) by s, t, u, v, with possible indexes, since at each 
point of this paper it is clear from the context to which set we refer. By capital 
letters A, B wc refer to term non-terminals and by C, D we refer to context non- 
terminals of a given STG. By TV we denote a non-terminal of the grammar in 
general. We denote by a the terminals of the grammar in general, by f,g the 
terminals of the grammar which represent a function symbol, by F, with possible 
indexes, both the terminals and non-terminals of the grammar representing context 
variables, and finally, we denote by x, y, z both the terminals and non-terminals of 
the grammar representing first-order variables. 

Now that the set T{TM U CM U E) has been introduced, given a term t € 
T{TM U CM US), we can define WG,t more formally. 

Definition 2.3. Let G = {TM, CM, S, R) be an STG. Let t be a term in TiTMU 
CA/'US) or a context in ClTMuCMuT,). Then, we define Wt recursively as follows. 

— If t = a{ti, . . . , t„) for some terminal a G S of arity m then Wf = a{wti j • • • j Wt„). 
— If t = N for some non-terminal N oi G with a rule N ^ u ^ R then wt = Wu- 
— If t = C{ti) for some context non-terminal C of G then Wt = wcWti- 
— If i = [•] then Wt = [•]• 

Definition 2.4. Let G = {TM,CM,T,,R) be an STG. Let S" be a set of non- 
terminals of G. We define restriction(G, 5) = {TM' ,CM' R') as the STG 
where TM' C TM, CM' C CM, and R' C R are the smallest sets such that 
G' = restriction(G', S) satisfies wg',n = wg,n for each non-terminal N in S. 

A directed acyclic graph (dag) can be defined as a particular case of an STG (in 
fact, this representation is in direct correspondence with the classic implementation 
of dags using adjacency lists). 

Definition 2.5. A DAG is an STG where the set of context non-terminals CM is 
empty, and moreover, there are only rules of the form A — ;> f{Ai, . . . , A„i). 

Example 2.6. The STG {Aq -> f{Ai,Ai), Ai 7(^2,^2), A^-i 
f{An, An) An ^ a} is a. DAG that represents the complete binary tree of height n 
over a function symbol / and a constant a. The size of this term is exponential, 

whereas its height is linear. 

Nevertheless, STG-represented terms may have exponential height in the size 
of the grammar in contrast to dags, which only allow for a linear height in the 
(notational) size of the dags. 

Example 2.7. The STG {Co CiCi, Ci ^ C2C2, C2 ^ C3C3, . . . , C„_i ^ 
CnCn, Cn — >■ 9{C), C — t- [•]} represents the context wcq = [■], whose height is 
exponential. This is not a DAG. 

A DAG G is called optimally compressed if equal terms are represented by the 
same term non-terminal. The test whether a DAG is optimally compressed can 
be performed in time 0{n ■ logn), and a transformation into optimally compressed 
form in time 0{n ■ logn). 
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Definition 2.8. The size \G\ of an STG G is the sum of the sizes of its rules, where 
the size of a rule N — >■ u is 1 + The depth within G of a non-terminal N is 
defined recursively as depth(A^) := 1 + max{depth(iV') | iV' is a non-terminal in u 
where N u £ G} and the maximum of an empty set is assumed to be 0. The 
depth of a grammar G is the maximum of the depths of all non-terminals of G, and 
it is denoted as depth(G). 

If the signature is fixed, then we could also use the number of rules as a complexity 
measure of STGs. 

Plandowski [Plandowski 1994; 1995] proved decidability in polynomial time for 

the word problem for SCFGs, i.e., given an SCFG P and two non-terminals A and 
B, to decide whether wa = wb- The best complexity for this problem has been 
obtained recently by Lifshits [Lifshits 2007] with time 0{\Pf). In [Busatto et al. 
2005; Schmidt-Schaui3 2005; Busatto ct al. 2008] Plandowski's result is generalized 
to STGs. Since the result in [Busatto et al. 2005] is based on a linear reduction 
from terms to words and a direct application of Plandowski's result, it also holds 
for the Lifshits result. Hence, we have the following. 

Theorem 2.9. ([Lifshits 2007; Busatto et al. 2005; 2008]) Given an STG G, 
and two tree non-terminals A,B from G, it is decidable in time 0{\G\'^) whether 
WA =wb- 

Several properties of STGs are efficiently decidable. The following lemmas will 
be used all along the paper. 

Lemma 2.10. Let G be an STG. The number |wjvl, for every non-terminal N 

of G, is computable in time 0{\G\). 

Proof. We give an alternative definition of \wn\ recursively as follows. 

—if {N f{Ni,...,Nra) e G) then |w;a,| = 1 + \wnA + ... + \wnJ, where 
Ni,. . . , Nm are non-terminals of G and / is a function symbol with ar(/) = m. 

— if N C1N2 then \wn \ = \wci \ + \wn2 \ — 1) where Ci is a context non-terminal 
and N2 is a non-terminal of G. 

The correctness of the above definition can be shown by induction on the size of 
wn- Moreover, since the recursive calls in the definition of |wjv| will be done, at 
most, over all the non-terminals of G, jw^vl is computable in linear time over |G| 
using a dynamic programming scheme. □ 

Lemma 2.11. Given an STG G, a terminal a, and a non-terminal N of G, it 
is decidable in time 0{\G\) whether a occurs in wn- 

Proof. Whether a occurs in wn can be computed efficiently again using a 

dynamic programming scheme: note that a occurs in iff either wj^ a £ G, 
or a occurs in wn' for some non-terminal A^' occurring in the right-hand side of 
the rule for N. □ 

3. A PTIME ALGORITHM FOR K-CONTEXT MATCHING WITH DAGS 

The context matching problem is NP-complete [Schmidt-SchauiS and Schulz 1998]. 
In this section we reconsider this problem by introducing the additional restriction 
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stating that the maximum number k of different context variables of a given instance 
is fixed for the problem. We refer to this problem as k-context matching, which is 
in fact a family of problems indexed by k. Our goal is to prove that a complete 
representation of all solutions is computable in polynomial time when the input 
terms are represented with dags. This variant is called k-context matching with 
dags (k-CMD problem). 

Our algorithm is presented as non-deterministic, but where the guessing can 
choose only from a polynomial number of possibilities. In Subsection 3.1, we solve 
the problem for the simpler case of uncompressed terms. This case is easy, but 
serves for a better understanding of some ideas appearing later, and shows the use 
of the non-determinism for simplifying explanations. In Subsection 3.2, we explain 
a situation where the context solution for a context variable can be inferred. It 
is used several times in the algorithm. In Subsection 3.3, we give the intuition 
behind the algorithm in order to help understanding the technical difficulties. In 
Subsection 3.4 we specify the data representation used in the algorithm, based on 
STGs. We explain the advantages of using STGs for representing dags, such as 
clarity, but also simplicity when analyzing complexity of the required operations 
for this problem. In Subsection 3.5 we present the set of rules of the algorithm, 
prove that they are sound and complete, and that they give in fact a complete 
representation of all the solutions for the initial set of equations. In Subsection 3.6 
we analyze complexity issues. 

3.1 k-Context Matching for Uncompressed Terms 

A non-deterministic polynomial time algorithm with few guessings can be easily 
obtained for the fc-context matching problem. Suppose we are given an instance 
{s = t} of the problem, whore t is a groimd term and s contains at most k different 
context variables. Any solution of {.s = t} instantiates every context variable by a 
context occurring in t. The number of different contexts in t is bounded by 
This is because any context occurring in t can be defined by two positions of t: 
the root position and the hole position of the context. Hence, it suffices to do 
at most k guessings of contexts for the context variables among \t\^ possibilities. 
After applying this partial substitution, we have to check if the resulting first- 
order matching problem has a solution. Since k is assumed to be fixed, the overall 
execution time is polynomial. 

When the input is compressed with dags, the problem becomes more difficult. In 
particular, the number of different contexts of the right-hand side can be exponential 
in the size of the input. For example, ti defined by ti = /(t2, ^2): ^2 = /(^3i ^3)1 1'2 = 
/(ig, t^),. . . ,tn = /(a, b),t'^ = f{b, a) has 2"~^ different contexts with the argument 
a, which precludes an efficient test for all contexts, e.g. in the matching problem 
F{fifib,a)Jia,b))) = ti. 

3.2 Inferring the Joint Context 

One of the key points for obtaining a polynomial time algorithm is the fact that 
in some cases, the (joint) context solution for a context variable can be inferred. 
Consider the simple case where we have two matching equations of the form F(s) = 
u and F{t) = v, and suppose that u and v are different. Suppose also that we know 
the existence of a solution a for these equations, but the only known information 
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for a is |hp(c7(F))|, i.e. just the length of the hole position of <j{F) and nothing else. 
It can be proved that this information suffices to obtain <j{F). With this aim we 

define below JointCon(7i, f , I) for any terms u and v, and natural number I, which 
intuitively corresponds to the supposed |hp((T(i^))|. 

Definition 3.1. Let v he terms, let I S N. We define JointCon(u, u, 0) to be 
the empty context [•]. We also define JointCon(/('Ui, . . . , Um),g{vi, . . . , Vm), I + 
1) = /(ui, . . . , JointCon(7ii, Wi, Ui+i, . . . , ?i„,) in the case where f = g 
and there exists i € [l,m] such that uj = Vj for all j / i. Otherwise, 
JointCon(/(ui, . . . , Um), f{vi, • • • , Vm), ^ + 1) is undefined. 

Note that in the second case of the previous definition, if f = g and such an i exists, 
then it is unique. This is because /(ui, . . . , Um) and g{vi,. . . , Vm) are difi'erent, and 
hence, Uj — vj for all j ^ i implies that Ui ^ Vi. 

Example 3.2. Let u,v,w be f{a,g{h{a,a),c),b), f{a,g{h{b,b).c),b) and 
g{J{aJ),c).b), respectively. Then. JointCon('u, u, 0) = JointCon((j,. (i,', 0) = [•], 
JointCon(u, t), 1) = /(a, [•],&), JointCon(M, w, 1) is undefined, JointCon(u, i;, 2) = 
/(a, 5(([-],c),6), and JointCon(u, 3) is undefined. 

Lemma 3.3. Let s,t,u,v be terms with u ^ v. Let a be a solution of {F{s) = 
u,F{t) = v}. Then a{F) = JointCon(u, w, |hp(a(F))|). 

Proof. We prove the claim by induction on |hp((T(F))|. If |hp((T(i^))| is 0, then 
(j{F) is [■], which coincides with JointCon(M, |hp((7(i^))|). Now, suppose that 
|hp((T(F))| is Z + 1 for some natural number I. This implies that a{F) is of the form 
f{wi, . . . , Wi-i, C[-], Wi+i, . . . , Wm) for some fimction symbol / and some i G [1, to]. 
Since cr is a solution of {-F'(s) = u,F(t) = v}, then u and v are of the form 
f{ui, . . . , Um) and f{vi, . . . , Vm), respectively. For the same reason, Wj = Uj = Vj 
for all j 7^ i, and moreover, (j(C[s]) = Ui and (T(C[i]) — Vi. Since u ^ v holds, 
we also have Ui ^ Vi. Consider a new context variable F' and the extension 
of (7 as (j{F') = C[-]. Then, a is also a solution of {F'{s) = Ui,F'{t) = Vi}. 
Note that \hp(<j(F'))\ is I. which is smaller than \hp{a{F))\. By induction 
hypothesis, cr(-F') — JointCon((i;, u;, |lip((T(i^'))|). Hence, we conclude (t{F) = 

f{wi,...,Wi-i,C[-],Wi+i,...,W,n) = f{wi,...,Wi-i,a{F'),Wi+i,...,Wm) = 

/(wi, . . . ,'u;i_i, JointCon(?i,,u,, \hp{a{F'))\),Wi+i, . . . ,Wm) = 
JointCon(w,w, |hp((j(F))|) □ 

3.3 The Intuition Behind the Algorithm 

The algorithm is presented as a set of non-deterministic rules, since this is easier to 
explain. When we reason about its complexity, we argue about the determinized 
version that computes all guessing possibilities. 

As already mentioned, we cannot directly guess a context of the right-hand side 
for every context variable, since there may be exponentially many contexts. In 
spite of this fact, we show that making an adequate use of the cases where the 
joint context can be inferred, the number of possibilities for each guessing can be 
drastically reduced. This fact allows us to use this approach also for the case when 
terms are represented with dags. 

After some standard applications of simplification and first-order variable elimi- 
nation, we can assume that any match-equation in the set A is of the form F{s) = t, 
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for some context variable F. Now, our goal is to remove one context variable by 
performing a guess, where the overall number of possibilities remains polynomial. 

Suppose first that A contains two equations of the form F(si) = ti and F{s2) = 
t2 with ti ^t2- Then we can infer the context as in the last subsection. However, 
we still need the length of the hole position of cr{F), for a possible solution a. But 
this length can be guessed from [0,min(height(ti),height(t2))] which is linear in 
the input size. 

Another situation is when A is of the form {-F(si) = t, . . . ,F{sn) = t} U A' 
for some term t and F docs not occur elsewhere. In this solution (T for 

A necessarily satisfies that a{F) is a certain context C such that t is of the form 
C[t'] for some subterm t' of t. Although there are exponentially many ways of 
choosing C, any of them can be used. Hence, we only have to look for t', which can 
be guessed among only a linear number of possibilities. Then the problem can be 
reduced to {si = t' , . . . , Sn = t'} U A'. Note that the variable F does not appear 
any more. 

Now, suppose that some context variable has an occurrence at some non-root 
position in some term occurring in A. A particular case occurs when there is an 
equation F{s) = t in A such that a subterm of s is of the form F{s'), i.e. the context 
variable F appears twice, at the root, and at some other position. Any possible 
solution a satisfies that either ct(F) is the empty context [•], which can be decided 
with a guessing, or else a{F{s')) equals a proper subterm t' of t. In the latter case, 
the pair of equations {F{s) = t,F{s') = t'} with t ^ t' allows us to proceed again 
by inferring the context, as in the first case. 

If none of the previous cases hold, then there exist equations J^i(si) = ti, ^2(52) = 
t2, ■ ■ ■ , Fn{sn) = tn in A, where Fi occurs in S2, F2 occurs in S3, and so on, 
and Fn occurs in si. In this sequence there is a maximal height term, say ti. 
Thus, height(ti) > height(t2). Note that S2 contains a subterm of the form 
Fi(s2)- Then, similarly as above, either (7(^2) = [•] or we can use the equations 
= ti,Fi{s'2) = t'2, with t'2 chosen from the proper subdags of t2, to infer 

(t{F2). 

With this approach each one of the k context variables is instantiated by a guess- 
ing among a polynomial number of possibilities. Hence, at this point we can bring 
forward that the final cost of the algorithm will be exponential in k, which is a 
constant of the problem. However, we also need to choose a representation for dags 
that allows to efficiently instantiate both first-order and context variables. This is 
done in the next section. 

3.4 Dag Representation of the k-CMD Algorithm 

Before presenting our algorithm for the k-CMD problem in detail, it is necessary 
to define how we represent dags and how our algorithm deals with such a represen- 
tation. As stated in Definition 2.5, dags can be represented as a DAG, which is a 
particular case of an STG, i.e an STG which docs not have context non-terminals. 
For reasons that will be made clear soon we encode dags using this representation. 

Definition 3.4. An instance of the k-context-matching problem with dags is 
a pair {A,G) where the STG G is a dag and A is a set of equations {Ag-^ = 
, . . . , = At^}, where each Ag^ and each A^. is a term non-terminal of G, and 
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each WAt. is ground. The question is to compute a substitution a (the solution) for 
the variables such that a{wA,.) = WAt. for every equation i in A. 

During the execution of the k-CMD algorithm, the equations are processed, and 
G is transformed in order to represent the partial solution at each step. More 
concretely, first-order variables arc converted into term non-terminals, and context 
variables arc converted into context non-terminals, whose generated terms and 
contexts represent substitutions of a partial solution. By variables we mean the 
variables of the problem and by function symbols we mean the terminals of the 
grammar which are not variables although, initially, all of them are terminals of 
the grammar. The initial G has no context non-terminals, and it may incorporate 
them in order to represent that the context variables have been instantiated. 

Our algorithm just instantiates variables by transforming them into non- 
terminals and adding rules for them, but does not change original rules. This 
ensures that right-hand sides of equations always represent subterms of an orig- 
inal WQ^At - Hence, although context variables are created during the execution, 
right-hand sides are always represented by a subset of the initial G, which continues 
being a dag according to Definition 2.5. 

Using STGs for describing dags, instead of just talking about dags understood 
as directed acyclic graphs, has several advantages. First, we do not have to think 
about nodes and arrows. STGs are more syntactic and it is easier and clearer to 
add or remove rules to/from an STG than to talk about redirecting arrows, new 
inserted nodes, etc. Second, the formalism of STGs is an improvement in clarity 
and simplicity with respect to the visual concept of solved form for representing 
partial and final solutions. At the end of the execution, the obtained substitution 
for a first order variable x will be Wx, i.e. this variable will be a term non-terminal, 
and its generated term will be the substitution computed for it. Analogously, 
a context variable F will be transformed into a context non-terminal, and the 
substitution computed for it will he wp. Third, analyzing the size increase of the 
representation due to variable instantiation is much simpler: adding a rule F a 
for a context variable F and transforming F into a context non-terminal is easy 
to analyze, whereas replacing each node in the dag labeled with F by new nodes 
representing its substitution is a more complicated operation. On the other hand, 
this representation has the disadvantage that the set of equations is not enough by 
itself, but needs the STG. For this reason, our algorithm needs to use the rules of G 
and perform some replacements of non-terminals by their corresponding definition. 

There is a case where our algorithm has to guess a partial solution from an 
exponential number of possibilities. This happens when we have equations F(si) = 
t,. . . , F{sn) = t, and the context variable F does not appear elsewhere. In this 
case, the only important information to be kept is which subterm t' of t has to be 
selected in order to generate the equations si = t' , . . . , Sn = t' . The solution for F 
might be any context C such that Ct' = t, that is, the hole position of the solution 
of F is any path from the root of t to an occurrence of t'. We do not want to guess 
C from an exponential number of possibilities. Unfortunately, these exponentially 
large set of contexts cannot be represented by G. For this reason, in the algorithm 
we have a third component, apart from the set of equations A and the STG G, 
representing the possible elections for the variables F of this kind. This component 
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is a set of expressions of the form F S Contexts{A, A'), representing that F can 
be replaced by any context C such that Cwa' = wa- Hence, our algorithm deals 
with triples (A, G, T) where F is the set containing this kind of expressions. 

As a last ingredient, we need to adapt the operation JointCon, presented in 
Section 3.2, to our representation. 

Definition 3.5. Let G be an STG and A, B be two term non-terminals of G such 
that WA 7^ wb and restriction(G, {A, B}) is a DAG representing ground terms. 
Let / be a natural number. 

Then, JointCG(G, A, B, I) is defined as an extension of G recursively as fol- 
lows. JointCG(G', A, _B, 0) contains G plus the rule C [■], where C is 
a new context non-terminal. For the case of JoiiLtCG{G,A,B,l + 1), if the 
rules of A and B are of the form A — >■ f{A-i,...,Ai-i,Ai,Ai^i,...,An) and 
B f{Ai,. . . ,A.i_^i,Bi,Ai+i,. . . ,An), for some i satisfying w^, ^ wb,, 

then JoiiLtCG{G,A,B,l + 1) contains JointCG(G, Aj, Sj, which has a con- 
text non-terminal C" generating JointCon(w^., w^., Z), plus the rule G — >■ 
f{Ai, . . . , Ai^i, C , Ai^i, . . . , An), where C is a new context non-terminal. In any 
other case, JointCG(G, A, B.l + 1) is undefined. 

Lemma 3.6. Let G be an STG and A,B be two term non-terminals of G such 
that wa 7^ Wb and restriction(G, {A, B}) is a DAG representing ground terms. 
Let I be a natural number. Assume also that restriction(G, {A, i?}) is compressed 
optimally, i.e. equal terms are represented by the same term non-terminal. 

Then, JointCG(G, A, B, I) adds at most depth(G) new context non-terminals 
to G, and has one symbol generating JointCon(u>A, w's. 0- Moreover, all the 
added context non-terminals G have rules which are of the form G [■] or 
C — >■ f{A\, . . . , Ai-i,C', Aj+i, . . . , An), where the terminal f is necessarily a func- 
tion symbol, i.e. it is not a variable. The time complexity of this construction is 
0(depth(G)). 

Definition 3.7. Let G be an STG and A,B he two term non-terminals of G such 
that Wa 7^ Wb and restriction(G, {A, B}) is a DAG representing ground terms.. 
Let F be a context variable which is a terminal of arity 1 of G. Let I be a natural 
number. Assume also that restriction(G, {A, B}) is compressed optimally, i.e. 
equal terms are represented by the same term non-terminal. 

Then, JointCGF(G, F, A, B, is an STG obtained from JointCG(G, ^, B, Z), 
which has a context non-terminal C not occurring in G and generating the context 
JointCon(w^, wb,1), by transforming F into a context non-terminal, and replacing 
the non-terminal G by F everywhere. 

3.5 Rules of the k-CMD Algorithm 

Definition 3.8. The fc-CMD algorithm is presented in figures 1, 2 and 3 as a 
set of transformation rules which deal with triples (A,G, F), where A is a set of 
equations defined over an STG G, where the right hand sides of equations are non- 
terminals in a DAG representing ground terms, and F is a set of expressions each one 
representing all solutions for a context variable, as described in the previous section. 
We assume that, initially, equal subterms in the right-hand sides of equations are 
represented by the same term non-terminal, i.e. optimal dag compression is used. 
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Unfold 1: 


(AU{A = B},G = 


{TJV U {A, B},CM, i:,Ru{A^ u}),T) 




{AU {u = B},G,T) 


Unfold2: 


{AU {CA = B},G 
RU{C^f{A^, 


= {TJVU{A,B},CMU{G},E, 

■ ■ , Ci, . . . , Am)}), r) 




(Au{/(Ai,.. 


,CiA,...,Am) = B},G,T) 


UnfoldS: 


(AU{CA = B},G 


= {TAfu{A,B},CMu{C},'£,Ru{C^[-]}),T) 




{AuA = B},G,r) 



Fig. 1. Unfold-Rules of the A;-CMD Algorithm 



(A U {f{ui, ...,Um)^B},G^ {TN U {Bi, . . . , Bm}, 

Decompose: CM, E u {/}, R{J{B^ f{Bi, Bm)}), r) 

{AU {ui = Bi, ... = Bm},G,T) 
where / is a function symbol (m — arity(/)) 

(A U {f{ui, ...,Um) = B)}, G={TMU{Bi,..., Bm'}, 

Fail: CAT, E U {/, g}, Ru{B g{Bi, ...,Bm')),T) 

_L 

where / ^ 5 are function symbols {m = arity(/), m' = arity(g'))). 

{Au{x = B}, G={rMu {B},CM, E U {x}, R),r) 
{AU {x = B},G' = {TN U {B, x},CN, E, i? U {a; ^ B}), V) 
where a; is a first-order variable and a terminal. 



Fig. 2. First-Order-Rules of the fc-CMD Algorithm 



This will hold during the execution. Given an instance of the problem ({A^^ = 
At^,...,As^= At^},G), the starting triple is {{A^^ = At^, . . . ,A^^ = At^},G,0), 
and the constant L occurring in the rules is maxi<i<„(height(wG^yiij )). 

There are two kinds of choices the algorithm can do. On the one side there are 
the "don't care" selections, which include the strategy stating which rule is applied 
and the selection of the equations involved in the rule application. On the other 
side we have the guessings, which make the algorithm non-deterministic. Those 
correspond to the decisions marked as "guessed" in the conditions of the rules, but 
also to the selection performed when the resulting part of a rule has a disjunction. 

We differentiate our set of inference rules in two disjoint subsets. We call the 
first rules unfolding rules (see Fig. 1), since their purpose is to replace the non- 
terminals of G occurring in the equations by their definition in G. Hence, these 
rides arc; related to our grammar-based representation for dags. We refer to the rest 
of the rules as solving rules, since they represent the actual algorithm as described 
in Section 3.3; these are splitted into the first-order rules (see Fig. 2) and the 
context- variable rules (see Fig. 3). The application of solving rules transforms the 
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(AU {FiAi) = Bi,FiA2) = B2}, 
ElimFI: G = (TJ\f U {Ai,A2, Bi,B2}, CM, S U {F}, R),T) 

(A U {F(Ai) = Bi,F{A2) = B2}, G' = JointCGF(G, F, 5i, S2, l),T) 
where F is a context variable and a terminal and wb-^ ^ wb^- Ms guessed over 
[0, L] such that JointCG(G, ^2, is defined. 

(A U {F{Ar) = B, F{A2) ^B,..., F{An) = B}, 
ELIMF2: G = {TU U {Ai, . ■ . , A„, B'},CU, S U {f }, J?), T) 

(A U {Ai = B', A2 = B',...,An = B'}, G,TU{F€ Contexts(B, B')}) 
where F is a context variable and a terminal not occurring in the w^a/s, nor 
Wu, for all equations u = v G A. B' is guessed over the term non-terminals of 
restriction(G, {B}). 

(A LI {F{A) ^B},G= {TM U {A, B, B'}, CM, S U {F}, R), V) 
(A U {F{A) = B},G = iTMu {A, B, B'},CM U {F}, 
J^LlMJ^d. E, i?U{F ->[•]}), r) 

I (A U {F{A) = B}, G' = JointCGF(G, F, B, B', /), T) 
where F is a context variable and a terminal occurring in wa- The term non- 
terminal B' is guessed over the term non-terminals of restriction(G, B) excluding 
B, and I is guessed over [1,L] such that Jo±iLtCG{G,B,B',l) is defined. 

(AU{Fi(^i) = Si,F2(A2) =^2}, 

G={TMu{Ai,A2,Bi,B2,B'2},CAf,^U{Fi,F2},R),T) 
ELIMF4: {AU {Fi{Ai) = Bi,F2{A2) = B2},G = {TAfU{Ai,A2,Bi,B2,B'2}, 
CA/'U{F2},SU{Fi},i?U{F2 ^ [•]}), r) 
I (A U {Fi(^i) = Si, F2(^2) = B2}, G' = JointCGF(G, Fi,Bi, B'2, l),T) 
where Fi ^ F2 are context variables that are terminals, Fi occurs in WA2, 
and height(wBj) > height(u>B2 )• B2 is guessed over the term non-terminals 
of restriction(G, {^2}) excluding B2, and / is guessed over [0,L] such that 
JointCG(G, Bi,B'2, 1) is defined. 



Fig. 3. Elim-F-Rules of the fc-CMD Algorithm 

set of equations into a new set. Depending on the case, more than one rule can be 
applied to a given set of equations. Hence, the inference system represents, in fact, 
a family of algorithms, depending on the strategy for deciding which rule to apply 
and to which subset of equations. As commented before, our initial set of equations 
is of the form {As^ = At^, . . . , As^ = ^t„}- But after applying the transformation 
rules, the form of these equations may change. Nevertheless, at any step of the 
algorithm the current equations are sim.ple, according to the following definition. 

Definition 3.9. Let G = {TN, TC, E, i?) be an STG, and let u = v be an equa- 
tion, where u,v e T{TAf U TC U S). The equation u = v is called simple over G if 
it is of one of the following forms. 

—A = B 
—CA = B 

—a{Ai,...,Am) = B 
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—f{Ai,.. . ,Ai^i,CiA,Ai+i, . . .,Am) = B, 

where A is a term non-terminal of G, B is a term non-terminal of G representing 
a ground term, C is a context non-terminal of G, and the terms a{Ai, . . . , Am) and 
f{Ai, . . . , Ai^i, Ci, Ai+i, . . . , Am) are right-hand sides of rules of G, for a terminal 
a, a terminals /, which is also function symbols, term non-terminals Ai, . . . , Am, 
and a context non-terminal Cj. Variables can only occur as some a. 

The following lemma shows that no rule of the form C — > C1C2 occurs in the 
fc-CMD algorithm. 

Lemma 3.10. Let (A,G, F) be a triple obtained by our algorithm at any point of 
the execution. Then, the rules of G are of the following forms. 

—A ->■ Ai 
—A ->• a{Ai, 
—A CAi 
-C ^ f{Au 
-C ^ [•] 

where A, Ai, A2, . . . , Am are term non-terminals of G, C,Ci are context non- 
terminals of G, a is a terminal of G, and f is a terminal of G, which is also a 
function symbol. 

Proof. Wc prove the lemma by induction on the number of applied inference 
rules. For the base case, note that the lemma holds for the STG Go given as 
input since, by Definitions 2.5 and 3.4, all the rules in Go are of the form A 
a{Ai, . . . , Am) for some term non-terminals Ai, . . . , Am and a terminal a of Gq. 

For the induction case, let (A',G',r') be the triple from which (A, G,F) was 
obtained by an inference rule application. By induction hypothesis (A', G', F') sat- 
isfies the conditions of the lemma. Wc distinguish cases according to the inference 
rule applied to (A', G', F') in order to show that the rules in G follow the conditions 
of the lemma. Note that for the inference rules that do not modify the STG {un- 
folding rules, Decompose, Fail, and ElimF2), this is straightforward. Otherwise, 
if Elimx was the applied rule, x became a term non-terminal and a rule of the 
form X ^ A was added to G' for some terminal x representing a first-order variable 
and term non-terminal A. Note that the added rule satisfies the conditions of the 
lemma. Finally, if the applied rule was either ElimF1,ElimF3, or ElimF4 then 
either G was extended by the JointCon construction or a rule F — >■ [•] was added 
to G', for some context variable F. By Lemma 3.6, in both cases all the added 
rules satisfy the condition of the lemma. □ 

Lemma 3.11. Let (A,G, F) be the triple obtained by our algorithm at a point of 
the execution. Then, the set A consists 0/ simple equations over G. 

Proof. Since for the triple given as input (Ao,Go,Fo — 0) all the equations in 
A arc of the form As = At for some term non-terminals As, At in Go, the state- 
ment of the Lemma holds in this case. Hence, for proving this lemma it suffices 
to check that after an inference step where (A, G, F) was obtained from a triple 
(A', G', r'), each new equation in A is simple over G. Checking this is an easy task 
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for rules Fail,Elimx, ElimFI, ElimF2, ElimFS, ElimF4, Unfold2, and Un- 
FOLD3, since the new produced equations are explicitly defined. For Decompose 
the result follows by induction hypothesis. Finally, the produced equations due to 
the application of Unfold 1 are of the form u = B, where 5 is a term non-terminal 
and u corresponds to a right-hand side of a rule in G' and hence, they satisfy the 
condition to be simple over G due to Lemma 3.10. □ 

Before proving soundness, completeness and termination of our inference system 
we should define a notion of solution of the triples the k-CMD algorithm deals with. 

Definition 3.12. A solution of (A, G, F) is a substitution a such that (j{wg,u) = 
wg,v for each equation m = u in A, (j{wg,x) = (^{x) for each first-order variable x, 
and a{wG,F) = '^(F) for each context variable F. Furthermore, for each context 
variable F such that {F G Contexts(j4i, ^4.2)) G F, where Ai and A2 are term 
non-terminals of G, it holds that a{F)wG,A2 = 'Wg,Ai- 

Let (A, G, F) be a triple generated by our algorithm at any point of the execution. 

Note that some of the variables may have been isolated and, hence, the STG G was 
extended in order to represent the corresponding instantiations. As stated in the 
previous definition, a solution of (A, G, F) has to be consistent with this extensions. 
The following lemma, together with the definition of a solution cr of (A, G, F), states 
that our representation for partial solutions by extending the grammar is correct 
in the sense that the same term is obtained by applying a solution to the term 
generated by G before and after such an extension. It will be helpful when proving 
soundness and completeness. 

Lemma 3.13. Let G = {TN',CJ\f,EU {V},R) be an STG obtained at any point 
of the execution of the k-CMD algorithm. Let V be a terminal of G representing 

either a first-order or a context variable. Let G' be the STG obtained from G by 
converting V into a non-terminal of the grammar and adding some new rules and 
non-terminals such that V generates a certain term or context wg',v- Let a he a 
substitution such that (t{V) = (t{wg' .v)- Let t be a term in T(7'A/'UCA/'U S U {y}) 
or a context in C{TAf U CM U U {V}). Then, a{wG,t) = cr(wG',t)- 

Proof. The proof is an easy induction on the size of t and the number of rule 
application s to derive WG,t- □ 

Lemma 3.14. The set of rules is sound. 

Proof. Let (A',G',F') be the triple obtained by our algorithm by applying an 
inference step on (A, G, F). By inspecting the rules, we can check that every solution 
a of (A',G',F') is also a solution of (A,G,F): We distinguish cases depending on 
which rule was applied for obtaining (A', G', F'} from (A, G, F). 

Note that the rules Elimx, ElimFI, ElimF3 and ElimF4 instantiate either a 
first-order or a context variable V. Therefore, if one of those rules was the rule 
applied to (A, G, F) then G' was obtained from G by transforming V into a non- 
terminal of the STG and adding some non-terminals and their corresponding rules 
such that V generates wg',v- By Definition 3.12, for being a solution of (A', G', F'), 
a satisfies cr{V) = a{wG',v)- Hence, G and G' satisfy the conditions of Lemma 3.13 
and we can conclude cr{wG,t) = (^{wG',t) for every term t in T{TAfuCAfuT,), where 
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G = (TM ,CM,'£i,R). It follows that a{x) = (j{wg,x) for every first-order variable 
X, and cf{F) = a{wG,F) for every context variable F. Moreover, since none of these 
rules changed neither the set A nor F, a is also a solution for (A, G, T). 

Suppose the rule applied is ElimF2. In this case, G' = G but both sets A and 
r are changed. Concretely, a set of equations of the form {F{Ai) = B, F{A2) = 
B,...,F{An) == B} of A is replaced by a set of equations of the form {Ai = 
B',A2 = B',...,An = B'} to obtain A' and the restriction F G Contexts(S, 5') 
was added to F to obtain F'. By Definition 3.12, since cr is a solution of (A', G' , F'), 
it holds (T{wG'.Ai) = WG',B' for each i G and a{F)wG' ,b' = wg'.b- Since G = 

G' and A- {F{A^) = B \ i e [l,n]} = A' - {Ai = B' \ i G [l,n]}, it suffices to prove 
that cr{wG,F(Ai)) = WG,B for i G [1, n] to show that a is also a solution of (A, G, F). 
Since G = G' and a{wG',Ai) = wg',b' then a{wG,Ai) = wg.b' holds. Furthermore, 
it holds that '7{wG,F(Ai)) = <y{F{wG,Ai)) = a{F)a{wG,Ai) = a{F)a{wG,B') = 
a{F)a{wG' ,b') = wg',b = wg,b- Hence, we proved that cr{wG,F{Ai)) = wg,b and 
thus cr is also a solution of (A, G, F). 

For rule Fail, it is obvious that the assumption of a solution a for the resulting 
triple (A',G',F') cannot be satisfied. 

Suppose the rule applied is Decompose. Then, G" = G, F = F' and an equa- 
tion f{ui, . . . ,Um) = B in A where B — >■ f{Bi, . . . ,Bm) is the rule in G is re- 
placed by the equations ui = Bi, . . . ,Um = Bm to obtain A'. Hence, it suffices 
to prove that o(wGj(ux,...,u,n)) ~ '^Gj(Bi_,...,B^) in order to show that a is also 
a solution for (A, G, F). Since cr is a solution of (A',G',F') it holds a{wG',ui) = 
WG',Bi,---,<j{wG',Ur^) = WG',B^- Thus, a(w;G,/(„i ,...,„„) ) = Cr(M;G', /(„!,...,„„)) = 
f{a{WG',ui),---,(r{WG',u^)) = f{WG',Bi,---,WG',B^) = f {WG,Bi, ■ ■ ■ ,WG,bJ) = 
WG.f{Bi.,...,B„,.)- 

In the case where the rule applied is an unfolding rule, note that these rules just 
replace non-terminals of G by their definition in G. Hence, since w^r = Wa for each 
non-terminal N with a rule N ^ a € G, every solution of (A',G',F') is also a 
solution of (A, G,F). □ 

The following lemma is an adaptation of Lemma 3.3 to our STG-based represen- 
tation for dags, which will be helpful when proving completeness. 

Lemma 3.15. Let G = {TAf,CAf,T,,R) be an STG. Let ui,U2 be term,s in 
T(TA/'UC7Vu S). Let Ai,A2,Bi and B2 be term non-terminals of G such that 
wg,Bi 7^ wg,B2 and bothwG,Bi andwG,B2 are ground. Lei restriction(G, B2) 
be compressed optimally as a DAG. Let a be a solution of {{F(ui) = Bi,F{u2) = 
B2},G,T — 0) where the context variable F is a terminal of G. Let G' = 
JointCGF(G,i^,Bi,B2,|hp(cr(F))|). Then, a{F) =wg',f- 

Proof. This lemma directly follows from Lemma 3.3 and Definition 3.7. □ 

Lemma 3.16. Every rule is complete. I.e. for every solution a o/(A,G,F), and 
for every rule application, there is a result {A', G', F') such that a is also a solution 
of (A', G',F'). Moreover, any maximal sequence of rule applications computes a 
representation of all solutions, by gathering all guesses and alternatives in the rules. 

Proof. Let cr be a solution for some triple (A, G, F) obtained by our algorithm. 
It suffices to show that after applying any applicable rule to (A,G, F), one of 
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the resulting triples {A',G',T') among the possible guesses also has a as solution. 
We distinguish cases depending on which inference step was applied for obtaining 
(A',G',r') from (A,G,r). Wc state explicitly here G = {TAf,CAf,E,R) because 
it will be necessary, in some cases, to refer to the set of terms T{TM U CM U E). 

Assume the applied rule is DECOMPOSEThen, G' = G,T = T' and an equation 
f{ui, . . . , Um) = B in A with rule B — ;> /(-Bi, . . . , Bm) is replaced by the equations 
Ui = Bi,. . . , Um = Bjn to obtain A', where each m e T{TAf U CM U S). Hence, 
it suffices to prove a{wG',ui) = WG',Bn ■ ■ ■ , o-{wg',u^) = 'ujg',b^ in order to show 
that a is also a solution for (A',G',r'). Since cr is a solution of (A, G, F), it holds 
that 0'{WQJ(^ui,...,Um )) = WG,f{Bi,...,B,^) which implies a{f{wG, m, ■■■ ,wg, Ur^)) = 
f{wG,Bi,---,WG,B„,), and hence a{wG,ui) = wg,Bi, • • • ,a-(wG,M„) = wg,b^- Fi- 
nally, since G = G', a is also a solution of (A', G', V). 

Assume the applied rule is Elimx. Then F = F' and A = A'. For a concrete 
equation x = B & A, G was extended to G' by converting x into a term non-terminal 
and adding the rule x — > i?. Since ct is a solution of (A, G, F) and a; is a terminal 
of G, wg,x = X and a{x) = wg,b holds. Furthermore, = wg',b = wg,b since 
B is the definition of x in G' and none of the rules of G were changed to obtain G'. 
Hence, (t{x) = wg,b — wg',b = wg',x = o'(wg',x)) where the last equality holds 
because wg',x is ground. Thus, we can apply Lemma 3.13 and claim that, for every 
term t in TiTAfuCMu E), a{wG,t) = (j{wG',t)- Hence, since F = F' and A = A', 
a is also a solution for (A', G', F'). 

For the Fail rule it is clear that the assumption on the existence of a solution 
cannot be satisfied. 

Suppose that the applied rule is ElimFI. In this case, A = A', F = F' and 

G was extended to G' by converting the terminal F, which is a context variable, 
into a context non-terminal. Some rules and non-terminals were added such that 
F generates a ground context wg',f- We first show that (j{F) = <t{wg',f) holds 
for one of the possible guesses when applying this rule. 

Since |hp(cr(F))| is smaller than or equal to L (|hp(c7(F))| G [0,L]) we can assume 
that I is guessed as |hp(cr(F))| in the rule application. Then, by the conditions for 
this rule application, there arc equations of the form F{Ai) = Bi,F{A2) = B2 in 
A such that wbi 7^ Wb2- Furthermore, both wbi and WB2 are ground and G' is 
constructed as JointCGF(G, F, Si, B2, |hp(cr(-F'))|). Hence, by Lemma 3.15, a{F) = 
wg'.f- Moreover, we can apply Lemma 3.13 and conclude that (7{wG,t) ~ <^{wg' ,t) 
for every term t G r(rA/'UCA/'UE). Thus, a is a solution of (A', G', F'). 

Suppose now that the applied rule is ElimF2. In this case, G = G', and 
some equations of the form F{Ai) = B,F{A2) = B, . . . ,F{An) = B of A such 
that F does not occur in Wu for any other equation u = v in A were replaced 
by the equations Ai = B',...,An = B' to obtain A'. Moreover, the restric- 
tion F £ Contexts(B, B') was added to F to obtain F'. Since cr is a solution 
of (A,G,F), it is also a solution of {FiAi) = B,F{A2) = B,...,F{An) = 
B}. Hence, there exists some subterm wg,b' of wg,b satisfying a{wG,Ai) = 
•wg,b', ■ ■ ■ ,<^{wG,A„) = w'G.B' which corresponds to WG,Blhp(<T(F))- In our rep- 
resentation choosing a subterm of wg,b is equivalent to choosing one of the 
term non-terminals of restriction(G, {-B}). Thus, we can consider the case 
where B' is the term non-terminal guessed in the rule application. In this case 
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<^{wc>,Ai) = wg',b',- ■ ■ ,o'{wc>,A„) = WG',B' holds since G = G' . Therefore, a is 
also a solution for (A', G', V). With respect to a{F), it satisfies (j{F)wg,b' = wg,b, 
which is exactly the condition added to T by the rule application in order to keep 
a representation of all possible instantiations for the context variable F. 

Suppose that the applied rule is ElimFS. In this case, A = A', F = F' and 
G was extended to G" by converting a terminal F representing a context variable 
into a context non-terminal. Some rules and non-terminals were added such that 
F generates the ground term wg',f- We first show that a{F) = (j{wg>,f) holds for 
one of the possible guesses when applying this rule. 

By the condition of this rule application, F{A) = i? is an equation in A where 
F occurs in wa- The case (j{F) = [■] is covered by the first alternative of the 
rule. Now assume that cr(F) ^ [•]. Since F occurs in wg,Ai there exists a proper 
subterm of Wq p^j^-^ (a subterm of wg.a) of the form F{u) for some term u 6 
T(S). Since (y{F{wG,A)) = wg,b holds and (j{F) ^ [•], there exists a proper 
subterm Wg,b' of Wg,b such that c7(F(u)) = Wg,b' and, for the same reason as 
in the previous case, B' is a term non-terminal in restriction(G, B) excluding 
B. We consider the case where the term non-terminal B' is guessed by the rule 
application and Z is guessed as |hp((T(F))|. When these two guesses are done, G' 
is constructed as JointCGF(G', F, i?, i?', |lip((T(F))|). Furthermore, we know that a 
satisfies a(F(u)) = wg,b' and o-{wg,f{A)) — wg,b- Moreover, wg,b' and wg,b are 
ground, and Wb' 7^ ^s, since wb' is a proper subdag of Wb- Hence, we can apply 
Lemma 3.15 and conclude cr{F) = wg',f- As before, we can apply Lemma 3.13 and 
conclude that a is a solution of (A', G', F'). 

Suppose that the applied rule is ElimF4. In this case, A = A', F = F' and G 

was extended to G' by either convcirting a terminal F2 or a terminal Fi ^ F2, each 
of them representing a context variable, into a context non-terminal. Each of these 
cases corresponds to one of the two alternatives of the rule. In the first case the 
rule F2 — > [•] was added, such that F2 generates 'Wg\f2 — ['Ij the empty context. In 
the second case some rules and non-terminals were added, such that generates 
the ground context Wg',Fi- We first show that either (j{F2) = o-(wg'.F2): in the 
former case, or (t{Fi) = a{wG',Fi) in the latter case. 

By the condition of the application of ElimF4, there is a pair of equations in A 
of the form Fi{Ai) = Bi and ^2(^2) = i?2- Furthermore, Fi occurs in iug,A2) 
and h.e±gh.t(wG,Bi) ^ lieight(wG^S2)- The case a{F2) = [■] is covered by the first 
alternative of the rule, and it is obvious that a{F2) = a(wG' ,F2) = [■] holds in 
this case. Now assume that a(F2) / [•]. Since Fx occurs in wg,A2i there exists 
a proper subterm of wg^f2(A2) (a subterm of wg.As) of the form Fi((y,), for some 
u e T{TJVLlCJ\fUT,). Moreover, since cr{wG.F2{A2)) = wg,B2 holds, and ct(F2) 7^ [•], 
there exists a proper subterm wg.b'^ of ^G^Sa such that a{Fi{u)) — wg.b'^ and, for 
the same reason as in the previous case, B2 is represented by a term non-terminal in 
restriction(G, B2) excluding B2, since the subterm is proper. We consider the case 
where the term non-terminal B'2 is guessed by the rule application and I is guessed as 
|hp(CT(Fi))|. Hence. G" is constructed as JointCGF(G, Fi, Bi, B^, |hp(f7(Fi))|). We 
know that a has to satisfy a{wFi{Ai)) — wg,Bi and a{Fi{u)) = wq^b^- Moreover, 
wg,Bi and wg,b'^ are ground, and wg,Bi 7^ wg^b'^ holds, since height(tz;G,Bj > 
height(wG,S2) and wg,b'^ is a proper subterm of wg,B2- Hence, by Lemma 3.15, 
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cr{Fi) = wc',Fi- As before, we can apply Lemma 3.13 and conclude that cr is a 
solution of (A',G',r'). 

Finally, assume that the applied rule is an unfolding rule. Note that the applica- 
tion of an unfolding rule does not modify the set of restrictions F nor the grammar 
G. A' is obtained by replacing the left-hand side of an equation u = w in A by a 
new equation u' = v. Since G = G", it holds that wg,u = wg',u- Moreover, since 
cr is a solution of (A,G, F), it satisfies cr{wG,u) = wg,v Hence, it suffices to check 
that wg,u = wg,u' when u = v \s replaced hy u' = v by the rule application. But 
this is direct from the fact that this replacements are due to a rule application of 
G, and we are done. □ 

Proposition 3.17. For every initial triple (Ao,Go,ro = 0), the 

determinized algorithm will compute a complete set of solved triples 
(Ai, Gi, Fi), . . . , (A„, G„, F„), such that a is a solution of (Ao,Go,Fo = 0) 
iff it is a solution of some (Aj, Gi, Fj), for i £ [1, n] . 

Proof. Termination holds, see the argumentation on the complexity in the next 
section. Since we have proved soundness and completeness, it remains to show that 
if some intermediate (A, G, F) is not solved, then an inference rule can be applied. 
The k-CMD algorithm represents instantiations of variables by transforming them 
into non-terminals of the STG. Hence, the fact that a triple (A, G, F) is not solved 
means that there are occinrences of terminals of G representing first order variables 
or context variables in A. 

Assume that no inference rule can be applied. We will deduce the form of the 
equations u = B € A under this assumption until we reach a contradiction. Let 
A,Ai,. . . , Am, B,Bi, . . . , Bjn be term non-terminals of G, let Gj, G be context non- 
terminals of G, and let /, g be terminals of G representing function symbols of arity 
m and m', respectively. 

Note that u cannot be of the form f{Ai,...,Am) nor 
/(Ai, . . . , Gij4, Ai+i, . . . , A„i) since, as f is a non-terminal B with rule 
B g{Bi, . . . ,Bm'), either Decompose (if / = g), or Fail (if f ^ g) would be 
applicable. Hence, by 3.11, at this point u can be of the forms x, F{A) or CA, 
where a: is a terminal of the grammar representing a first-order variable and F is 
a terminal of the grammar representing a context variable. This implies that, if 
u = X then Elimx is applicable, and if u = CA then UNFOLD2 is applicable. Thus, 
u can only be of the form F{A). Hence, since we argued about an arbitrarily chosen 
equation u = v G A, every equation i in A is of the form Fi{Ai) = Bi. Moreover, 
since neither ElimF1,ElimF2 nor ElimF3 can be applied, for every terminal Fi 
representing a context variable occurring in A there exists an equation Fj {Aj) = Bj 
in A, such that Fi is diffcrc^nt from the terminal Fj, and Fi occurs in WAj ■ Since the 
set A is finite, there exist equations Fi{Ai) = Bi, ^2(^2) == -B2, ■ . ■ , i^„(A„) = Bn 
with n > 2 satisfying that Fi occurs in WA2, F2 occurs in WA3, ■ ■ ■ , Fn-i occurs 
in wa„, and F„ occurs in vuai, and where the Fi's are pairwise different. Let i be 
such that has maximal height among the 'WBi_, ■ ■ ■ ,Wb^, say i = 1. Hence, we 
may take the equation Fi{Ai) = Bi and the equation ^2(^2) = B2 and apply rule 
ElimF4, which is a contradiction. □ 
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The following example shows that the Decompose rule may have an exponential 
number of executions if multiple insertions of the same equation in one inference 
sequence is not prohibited. Hence, our algorithm must keep track of already treated 
equations in order to avoid this fact. 

Example 3.18. Let G be an STG defined by the following set of rules: {Bi — >• 

f{B2, B2), B2 — >■ /(-B3, B3), . . . ,Bi — >• /(Bj+iBj+i), . . . , Bn-l — >■ f{Bn,Bn), Bn ->■ 

a, A, ^ f{A2,A'^),A2 ^ f{A3,A'^),....A, ^ f{A,+,Ar^,).....A,,_, ^ 
f{A^,A'J,A^ ^ a,A[ ^ f{A2,A'^),A'^ ^ f{A„A',),...,A'^ ^ 
^ f{A„,A'J,A'^ ^ a,A^ fiA,,A[),B ^ f{BuB,)} 
Wc now consider a decomposition sequence for the equation A ^ i?, it decom- 
poses depth- first. Note that G satisfies the assumption on an optimally compressed 
representation of restriction(G, {B}). 

{A = B}=^{f{A,,A{) = B} 

=^ {A,=B,,A[=B,} 

^{f{A2,A'^)=B„A[ = B,} 

=^ {A2 = B2, A'^ = B2, A[ = Bi} 

=^ {/(A3, A',) = B2, A!^ = B2, A[ = B,} 

=^ {A3 = B3,A'3 = B3,A'2 = B2,A[ = Bi} 

{Ai = Bi, Ar = Bi, Ar_^ = Bi_i,. ..,A[ = B^} 

=^ {An = Bn, A'^ = Bi,A'^_i = Bn-1, ■ ■ ■ , A[ = Bi} 

=> {a = a. a = a, A'^_-^ = i?„_i, . ..,A[=Bi} 

Hence, the depth-first strategy may lead to an exponentially long sequence of 
decompositions. 

3.6 Complexity of the k-CMD Algorithm 

Let (A = {As, = At,,...,As^ = AtJ,G = {rN,CN,T.,R),T = 0) be the initial 
configuration of the execution, and lot (A',G",r') be the last one. Recall that 
L = maxi<i<„(height(wG,At. )) and k denotes the number of different context 
variables in the problem. Let V denote the set of first-order variables. 

Our inference rules may add new non-terminals and their corresponding rules to 
the grammar. Concretely, at most \V\ rules of the form x — > A and at most kL rules 
of the forms C f{Ai, . . . , Ai-i,Ci, Ai+i, . . . , Am) and C — )• [•] are added to G 
during an execution. Therefore, at any point of the execution, any right-hand side 
of a rule of the current STG G" of the form f{Ai, . . . , Am) is in fact a right-hand 
side of a rule of the initial G. 

We count the number of different equations u = v that may appear during the 
execution. Our equations are simple with respect to the final G" by Lemma 3.11. 
Thus, u is either of the form A {\TM\ + \V\ possibilities), or f{Ai, . . . ,Am) (an 
original right-hand side of a rule, thus \TN\ possibilities), or CA {kL{\TN\ + |V"|) 
possibilities), or f{Ai, . . . , Ai-i,CiA,Ai+i, . . . , Am) {kL{\TM\ + \V\) possibilities). 

On the other hand, v can only be a term non-terminal A. an original term 
non-terminal, thus ITA/"] possibilities. 
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Therefore, the total number of different equations in a branch of non-deterministic 
execution is 0(depth(G)|G|^). Assuming we avoid repetition of equations, this will 
also be the maximum number of execution steps. Each of those steps chooses an 
equation and applies an inference rule to it. The corresponding operations can 
be performed in logarithmic time with the adequate data structures. Thus, the 
non-deterministic execution time is 0(depth(G')|G|^log(|G|)). 

k guessings over L possibilities are done during the execution. There- 
fore, the execution time of the deterministic version of this algorithm is 
0((depth(G))'=+i|Gplog(|G|)). 

Theorem 3.19. Computing all solutions (and hence deciding solvability) of an 
instance of the k-context matching with dags problem can be done in polynomial 
time. The worst case running time is 0((depth(G))'^"'"^|G|^log(|G|)), where k is 
the number of context variables and \G\ is the size of the input dag. 

4. GRAMMAR CONSTRUCTIONS 

For the description and analysis of efficient algorithms for context matching, first- 
order unification and first-order matching of STG-compressed terms we need several 
extension constructions of STGs. These algorithms have as suboperations finding 
differences in two terms and performing instantiations of context-variables and first- 
order variables. The difficulties are induced by the task of performing all the 
required operations on the compressed representation of terms. 

In [Busatto ct al. 2005] it was shown how to succinctly represent the preorder 
traversal word of a term generated by an STG using an SCFG. We reproduce this 
construction in Subsection 4.1 to compute an SCFG Pre^ with non-terminals Vg 
and Vt generating pre(s) and pre(t), respectively. We also need to compute, given 
Prec, the smallest index k in which pre(s) and pre(t) differ. In Subsection 4.2 
we show how to perform this task efficiently. Our approach is based on a recent 
result on compressed string processing [Lifshits 2007]. As commented above, k 
corresponds to a unique position p G Pos(s) nPos(t). In Subsection 4.3, we present 
the procedure, given G and k, to extend G such that a new non-terminal generates 
t\p. Avoiding the explicit calculation of p refines the approach presented in previous 
work in STG-compressed first-order unification [Gascon et al. 2009] in order to 
obtain a faster algorithm. 

We also need to apply substitutions once a variable is isolated. Performing a 
replacement of a first-order variable a: by a term u is easily representable with 
STGs by simply transforming x into a non-terminal x of the grammar and adding 
rules such that x generates u. However, since successive replacements of variables 
by subterms modify the initial terms, we have to show that this does not produce an 
exponential increase of the size of the grammar, since its depth may be doubled after 
each of these operations. To this end, we develop a notion of restricted depth, and 
show that its value is preserved during the execution, and that the size increase at 
each step can be bounded by this restricted depth, which is shown in Subsection 4.4. 

4.1 Computing the preorder traversal of a term. 

In [Busatto et al. 2005] it is shown how to construct, from a given STG G, an SCFG 
Prec representing the preorder traversals of the terms and contexts generated by 
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f{Au...,A,r.) 
A C1A2 
A^Ai 

C C1C2 

C — >■ f{Ai, . . . , Ai-i, d, Ai+\, . . . , Am) 

C^[.] 

Fig. 4. Generating the Preorder Traversal 

G. We reproduce that construction here, presented in Figure 4 as a set of rules 
indicating, for each term non-terminal A and its rule A — > a of G, which rule 
Va a' of Prec is required in order to make a non-terminal Va of Prec satisfy 
Wpreo.VA = pre(wG.A)- To this end, for each context non-terminal C of G we also 
need non-terminals of Preg generating the preorder traversal to the left of the hole 
{jCc), and the preorder traversal to the right of the hole {TZc)- 

It is straightforward to verify by induction on the depth of G that, for every term 
non-terminal A of G, the corresponding newly generated non-terminal Va of Prec 
generates pre(w^). 

Lemma 4.1. LetG be an STG. A SCFGPreG of size 0{\G\) can be constructed 
in time 0{\G\) such that, for each non-terminal N ofG, there exists a non-terminal 
Vn in Prec satisfying Wprec,'PN = pi'e(wG,w)- 

4.2 Computing the first different position of two words. 

Given two non-terminals pi and p2 of an SCFG P, we want to find the smallest 
index k such that Wp^ [k] and Wp^ [k] are different. In order to solve this problem, 
a linear search over the generated words Wp-^ and Wp^ is not a good idea, since 
their sizes may be exponentially big with respect to the size of P. Hence, one may 
be tempted to apply a binary search since prefixes are efficiently computable with 
SCFGs and equality is checkable in time 0{\P\^), which would lead to 0(|P|"*) 
time complexity. However, we will use more specific information from Lifshits' 
work [Lifshits 2007] to obtain £)(|P|^) time complexity. 

Lemma 4.2. [Lifshits 2007] Let G be an SCFG. Then a data structure can be 

computed in time 0{\G\^) which allows to answer to the following question in time 
0{\G\): given two non-terminals Ni and N2 of G and an integer value k, does wn^ 
occur in wn^ at position k? 

Thus, assume that the pre-computation of Lemma 4.2 has been done (in time 
0{\P\^)), and hence we can answer whether a given Wp^ occurs in a given Wp^ at a 
certain position in time 0(|P|). 

For finding the first different position between pi and p2, we can assume \vup^ \ < 
Iwp^l without loss of generality. Moreover, we also assume Wp^ ^ [I'-Iw'pi |]) i-e 
Wp^ is not a prefix of Wp^. Note that this condition is necessary for the existence 
of a different position between vup^ and vup^ , and that this will be the case when pi 
and p2 generate the preorder traversals of different trees. Finally, we can assume 
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iiidex(j>i ,p2 ,k' ,P)— ' 



_ in<iex(pi2,p2,A;' + \wp^^ \,P) 



index(pii,p2,A;',P) 



, if \wp, I = 1 

, if (pi — > P11P12) G Pa 

Wpii # [(fe' + 1) . . . (fc' + |u)pii I)] 
, if (pi P11P12) € PA 

Wpil = [(fc' + 1) . . . (fc' + I)] 



Fig. 5. Algorithm for the Index of the First Difference 



that P is in Chomsky Normal Form. Note that, if this was not the case, we can 
force this assumption with a linear time and space transformation. 

We generalize our problem to the following question: given two non-terminals pi 
and p2 of P and an integer k' satisfying k' + \wp^\ < {wp^l and Wp^ ^ '"^P2[(^' + 
l)..(fc' + which is the smallest fc > 1 such that Wpjfc] is different from 

Wp^ik' + fc]? (Note that we recover the original question by fixing fc' = 0). 

This generalization is solved efficiently by the recursive algorithm given in Fig- 
ure 5, as can be shown inductively on the depth of pi . By Lemma 4.2, each call takes 
time 0{\P\), and at most depth(P) calls arc executed. Thus, the most expensive 
part of computing the first different position of Wp^ and Wp^ is the pre-computation 
given by Lemma 4.2, that is, 0(1^1'^). 

Lemma 4.3. Let P be an SCFG of size n, and let pi,P2 be non-terminals of P 
such that Wp^ ^ Wp^ . The first position k where Wp^ and Wp^ differ is computable 
in time 0{\P\^). 

4.3 Isolating variables 

As commented in Section 2, the index fc from the previous subsection defines a 
position p = iPos(/:, fc) of a term t generated by an STG G. We show how to 
compute, in linear time, an extension of the STG G with a non-terminal generating 
t\p. We use the SCFG Prec presented in Definition 4.1. 

Definition AA. Let G be an STG. Let be a non-terminal of G, and let fc be 
a natural number satisfying fc < |Pre(wG,Ar)|- We recursively define kExt(G, A/", fc) 
as an extension of G as follows: 

— If fc = 1 then kExt(G, fc) = G. In the next cases we assume fc > 1. 



—If (AT ^ /(iVi, . . . , Afi_i,iVi, . . .,N^)) € G and 1 + \wnA + • ■ - + kiv._il = k' < 
k<k' + \wNi I then kExt(G, N, fc) = kExt(G, AT,, k-k'). 



— If (A^ C1A2) G G and fc < |w^PreG,£ci I then kExt(G, A^, fc) includes 
kExt(G, Gi, fc), which contains a non-terminal A'^' generating the subterm of 
wg,Ci at position iPos(t/;G,Cn fc)- If ^' is a context non-terminal then 
kExt(G, A^, fc) additionally contains the rule A — >■ N'A2, where A is a new term 
non-terminal. 

— If {N ->• G1G2) G G and fc < |wpreG,£cil then kExt(G, A^, fc) includes 
kExt(G, Gi, fc). which contains a non-terminal A'^' generating the subterm of 
wg,Ci at position iPos(wG,Ci, fc)- If ^' is a context non-terminal then 
kExt(G, A^, fc) additionally contains the rule G N'C2, where G is a new context 
non-terminal. 
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—If {N C1N2) e G and k' = |wpreG,/:ci I < k < |wpreG,£ci I + \^N-2\ then 
kExt(G, N, k) = kExt(G, N2, k - k'). 

—If {N C1N2) G G and I^PrecrcJ + \wn2\ < k then kExt(G, A/', fc) = 

kExt(G, Ci,fc - IW7V2I + !)• 
—If {A Ai) eG then kExt(G, A, k) = kExt(G, Ai,k). 
— In any other case kExt(G, N, k) is undefined. 

Lemma 4.5. Let G he an STG. Let N a non-terminal ofG, and let k be a natural 
number such that k < |Pre(wG,Ar)|- Then G can be extended to an STG G' in time 
0{\G\) with C'(depth(G)) new non-terminals such that one of them generates the 
subterm of wq.n at position iPos(u>G^jv, fc). 

Proof. The fact that kExt(G, N, k) is an extension of G satisfying the state- 
ments of the lemma follows by induction on depth(A''), distinguisliing cases accord- 
ing to the definition of kExt(G, A^, fc), and applying the definition of iPos from 
Section 2. To compute kExt(G, N, k) in linear time we first build the SCFG Preg 
generating the preorder traversals of the terms generated by G and pre-compute 
the size of the term/word generated by each non-terminal in G and PreQ. Both 
operations can be done in linear time as stated in Lemma 4.1 and Lemma 2.10. 
Once this pre-computations are done, kExt(G, A'', fc) can be computed by a single 
run over the rules of G, which leads to the desired time complexity. □ 

4.4 Application of substitutions and a notion of restricted depth 

Recall that, when working with STGs, we represent the application of a substitution 
on a first-order variable x by transforming x into a term non-terminal and adding 
the necessary rules such that x generates the term to which it is assigned. 

When one or more substitutions of this form arc; applied, in general the depth 
of the non-terminals of G might increase. In order to see that the size increase is 
polynomially bounded after several substitution operations when unifying, we need 
a new notion of depth called Vdepth, which does not increase after an application 
of a substitution. It allows us to bound the final size increase of G. The notion 
of Vdepth is similar to the notion of depth, but it is for the non-terminals N 
belonging to a special set V satisfying the following condition. 

Definition 4.6. Let G = {TAf,CJ\f,i:,R) be an STG, and let F be a subset of 

TJ^U S. We say that V is a X-set for G if for each term non-terminal A in V, the 
rule of G of the form A — t- u is a A-rule, i.e. u is a term non-terminal. 

Definition 4.7. Let G = {TN,CN,T,,R) be an STG and let F be a A-set for 
G. For every non-terminal N of G, the value Vdepth^ y ( A/') , denoted also as 
Vdepth^/(A^) or Vdepth(A^) when G and/or V are clear from the context, is defined 
as follows (recall the convention that niax(0) = 0). 

Vdepth(A^) := Qiov N <eV 

Vdepth(A^) := 1 -|- max{Vdepth(A^') | A''' is a non-terminal occurring in w, 

where N ^ u £ G}, otherwise. 

The Vdepth of G is the maximum of the Vdepth of its non-terminals. 
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The idea is that V contains all first-order variables, before and after converting 
them into term non-terminals. The following lemma is completely straightforward 
from the above definitions, and states that a substitution application does not 
modify the Vdepth provided X € V for the substitution X ^ A. 

Lemma 4.8. Let G, V be as in the above definition. Let X gV be a terminal of 
G of arity 0, and let A be a term non-terminal of G. Let G' be the STG obtained 
from G by transforming X into a term non-terminal and adding the rule {X ^ A). 
Then, for any non-terminal N of G it holds that Vdepth^, (iV) = Vdepth^ (N) . 

We also need the fact that Vdepth does not increase due to the construction of 
kext(G, A, k) from G. However, we first prove a more specific statement. 

Lemma 4.9. Let G be an STG, let C be a context non-terminal of G, let V be a 
X-set for G, let k be a natural number such that wc\±pos{wc,k) is a context, and let 
G' bekext{G,C,k). 

Then, for every non-terminal N of G it holds that Vdepthg,(Af) = Vdepth^,, (A'^), 
and for every new non-terminal N' in G' and not in G, it holds that 
Vdepthg,, (AT') < Vdepthg(C). Moreover, the number of new added non-terminals 
is bounded by Vdepthg,(C). 

Proof. The identity Vdepth(5(Af) = VdepthQ,(A^) for each non-terminal N of 
G is straightforward from the fact that kext(G, C, fc) does not change the rules 
for the non-terminals occurring in G. To prove the fact that Vdepth^., (A^') < 
VdepthQ(C) for each new non-terminal N' in G' and not in G, plus the fact that 
at most Vdepthg,(G) new non-terminals have been added, we will use induction on 
VdepthQ(C). The base case (VdepthQ(G) = 1) trivially holds since, in this case, 
the STG G is not modified (note that necessarily k = 1). For the induction step 
we distinguish cases according to the definition of kExt(G, G, A;): 

— Assume that (G f{Ai,...,Ai-i,C',...,Am)) e G. Note that, since 
wc\ifos(wc,k) is a context, it holds that 1 -I- + ... + I'lt'Ai-il ~ k' < 

k < k' -\- \wc'\- In this case, kext(G, G, /c) = kext(G, G',A; — k') and, since 
Vdepth(5(G') < Vdepth(5(G), the lemma directly follows by induction hypothe- 
sis. 

— Assume that (G — >■ G1G2) G G and k < Iwprec^cJ- t^iis case, 

the construction of kext(G, G, /c) is done by computing kext(G, Gi,fc) and 
adding the rule G' — > G(G2, where G( is the context non-terminal gen- 
erating wci |iPos(tuci ^^'^ ^' is an additional new non-terminal. Since 
Vdepthg,(Gi) < VdepthQ(G), by induction hypothesis, it holds that for all the 
new non-terminals N' in G' = kext(G, Gi, fc), Vdepth^., (iV') < VdepthQ(Gi) 
and at most VdepthQ(Gi) new non-terminals have been added. It follows 
that at most Vdepthg,(G) new non-terminals have been added in the con- 
struction of kext(G, G, fc), and VdepthQ,(G() < Vdepthg,(Gi). Moreover, 
since VdepthQ(G) = 1 + max(VdepthQ(Gi), VdepthQ(G2)), Vdepth(j,(G') = 
l-|-max(VdepthQ,(G(), VdepthQ,(G2)) and Vdepthg,(G2) = Vdepthg, (G2), it also 
holds that Vdepthg,(G') < Vdepthg,(G). 

— Assume that (G — >■ G1G2) € G and fc' = |wpreG,£ci I < ^ < I^Prec^ci I + l'^C2l- 
In this case, kext(G, G, fc) = kext(G, G2,fc — fc') and, since VdepthQ(G2) < 
VdepthQ(G), the lemma directly follows by induction hypothesis. 
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Finally, note that the case (C — t- C1C2) S G and |wpreG,£ci I l'"'C2l < ^ is not 
possible due to the assumption that wc\±pos(wc ,k) ^ context. □ 

Lemma 4.10. Let G be an STG, let N be a non-terminal of G, let V be a X- 
set for G, let k be a natural number satisfying k < |Pre(wG,Jv)|; o-nd let G' be 
kext(G,iV,fc). 

Then, for every non-terminal N' of G it holds that Vdepthg.(A^') = 
Vdepthp, (A'^'), and for every new non-terminal N" in G' and not in G, it holds that 
Vdepthg, (iV") < Vdepth(G). Moreover, the number of new added non-terminals is 
bounded by Vdepth(G) . 

Proof. The identity Vdepthg,(Af') = Vdepthg, (iV') for each non-terminal TV' 
of G is straightforward from the fact that kext(G, A'^, k) does not change the rules 
for the non-terminals occurring in G. We will prove the fact that Vdepthg, (iV") < 
Vdepth(G) for each new non-terminal N" in G' and not in G, plus the fact that at 
most Vdepth(G) new non-terminals have been added by induction on depthQ(A'"). 
The base case (depth(A^) = 1) trivially holds since, in this case, the STG G is not 
modified. For the induction step we distinguish cases according to the definition 
of kExt(G, A^, fc). The only interesting cases are when [N G1A2) € G and 
k < IwprecXc i> ^'^d when {N — >• G1G2) G G and k < \wpreG,Cc I- Note that these 
are the only cases in which the grammar might be extended with new non-terminals 
after the recursive call. We will solve the first one, the other is solved analogously. 

Hence, assume that {N — >• C1A2) £ G and k < \wpreG,Cci\- this 
case the non-terminal N' in kext(G, Gi,fc) generating the subterm of wg,Ci 
at position iPos{wG,Ci,k) is a either a term non-terminal or a context non- 
terminal. We will solve the two cases separately. First assume that A''' is a 
term non-terminal. In this case kext(G, A^, fc) is constructed as kext(G, Gi, fc). 
Since VdepthQ(Gi) < VdepthQ(A'^), the lemma holds by induction hypothe- 
sis in this case. On the other hand, if N' is a context non-terminal, the 
construction of kext(G,A^, fc) is done by computing kext(G, Gi, /c) and adding 
the rule A — N'A2, where A is an additional new term non-terminal. By 
Lemma 4.9, for all the new non-terminals N" in kext(G, Gi,fc) and not in 
G, Vdepthg,, (A^") < Vdepth(y(Gi). Moreover, the number of new added non- 
terminals is bounded by Vdepthg.(Gi). Hence, Vdepthg, (A^') < VdepthQ(Gi) 
and, since Vdepthg,(Gi) < VdepthQ(A^), at most Vdepthg.(A^) < Vdepth(G) new 
non-terminals have been added in the construction of kext(G, A^, fc). Further- 
more, since VdepthQ(Af) = 1 + max(VdepthQ(Gi), Vdepthg(A2)), VdepthQ,(^) = 
1 -|-max(Vdepthg,(A^'),Vdepthg,(A2)) and Vdepthg(A2) = Vdepthg, (yl2), it also 
holds that Vdepth(5,(A) < Vdepthc,(Ar) < Vdepth(G). □ 

5. NP-COMPLETENESS OF CONTEXT MATCHING WITH STGS 

As a complement to Theorem 3.19, we are now ready to show that context matching 
with STG-compressed terms is in NP-complete. NP-hardness with STGs follows 
from NP-hardness of the same problem without any compression (see [Schmidt- 
Schaufi and Schulz 1998]). Hence, we just have to prove that this problem is in 
NP. Our goal is to be able to guess a solution of polynomial size for a given input 
context matching problem, and to check it efficiently. To this end, we first introduce 
definitions of prefix and sufBx of a context and subcontext of a term as extensions 
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of the original STG, and argue that the size of such extensions is polynomiaUy 
bounded by the size of the original STG. Part of the used ideas are borrowed from 
[Levy et al. 2006b; 2004], but adapted to show a concrete complexity measure. 

5.1 Grammar-extensions for hole path and subcontexts 

Definition 5.1. Let G be an STG. We define the SCFG Ha representing the 
hole paths of wc for all context non-terminals C as follows. For each context non- 
terminal C of G we construct a non-terminal He of Hq- For each natural number i 
between 1 and the maximum arity of the signature S, wc construct a non-terminal 
Hj^. For each rule with a context non-terminal C as left-hand side, we construct 
one rule of Hq, depending on the form of the rule of C in G, as follows. 

— if (C ^ [•]) € G, then Hq contains the rule He — > A. 

— if (C ^ C1C2) e G, then Hq contains the rule He -)■ Hc^Hc^. 

— if (C f{A\,...,Ai-i,Gi,Aij^\,...,Am)) € G, then Ha contains the rules 

Hc^H.Hc,. 

Moreover, for each Hi wc construct the rule Hi — > i. 

Lemma 5.2. The SCFG Ha can he computed for an STG G in time 0{\G\). 
For every context non-terminal C & G, the corresponding non-terminal He G Ha 

generates 'tnp{wc)- Moreover, \Ha\ < \G\ + M and depth{Hc) < depth{C) for all 
C , where M is the maximum, arity of the signature. 

Proof. It is easy to prove that whc = hp(wc) as well as depth{Hc) < depth{C) 
using induction on depth{C). Moreover, from every rule of G we produce one rule 
of Ha^ and for every i between 1 and M we produce one rule of Ha, which leads 
to a linear time algorithm with respect to |G|. □ 

Definition 5.3. Let G be an STG describing first-order terms and contexts, let 
C be a context non-terminal of G, and let I be a natural number such that I < 
\h.-p{wc)\- We define the extension Pref(G, G, Z) of G representing a prefix of wc 
recursively as follows. 

— If I = 0, then Pref (G, G, /) contains G plus the rule G' — [•], where G' is a new 
context non-terminal. In the next cases we assume I > 0. 

— If I = \'hp{wc)\, then Pref (G,G,Z) := G. In the next cases we assume I < 
|hp('u;c)|. 

—If (G ^ G1G2) e G and Z > \hp{wcj\. Then Pref(G,G,0 includes 
Pref (G, C2J — |lip(77;ci)|), which contains a non-terminal G2 generating the pre- 
fix of WC2 with \hp{wc^)\ = I — |hp(wci)|, Plus the rule G' — >■ G1G2, where G' is 

a new context non-terminal. 

— If (G C1C2) 6 G and I < |hp('u;cj|, then, we define Pref (G,G,/) as 
Pref(G,Gi,/). 

—If (G -> /(Al, . . . , Aj_i,Gj, Aj+i, . . . e G, then Pref(G,G,0 includes 

Pref (G, Gj, / — 1), which contains a non-terminal G- generating the prefix of Wd 
with |hp(u;c^)| =1-1, plus the rule G' /(^i, . . . , Ai_i, G^, A^+i, . . . , e 
G, where C' is a new context non-terminal. 
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Lemma 5.4. Let G be an STG describing first-order terms and contexts, let C be 
a context non-terminal of G, and let I be a natural number such that I < |hp(wc')|- 
Then, Vref{G,C,l) is an extension of G computable in time 0{\G\). It adds 
at most depth(C) non-terminals such that one of them, called G' , generates the 
prefix of wc satisfying |hp(i(;c'')l = ^- Moreover, depth(C") < depth(C) and 
depth(Pref (G, G, I)) = depth(G). 

Proof. The correctness of the definition of Pref (G, G. I), as well as depth(G') < 
depth(G) and depth(Pref (G, C, /)) = depth(G) can be easily shown by induc- 
tion on depth(G). With respect to time complexity, we first precompute |hp(wc)| 
for each context non-terminal of G, which can be done in linear time thanks to 
Lemma 2.10 and Lemma 5.2. Time complexity OdGj) follows from the fact that 
the recursive definition decreases the depth of the involved non-terminal. □ 

Definition 5.5. Let G be an STG describing first-order terms and contexts, let 

G be a context non-terminal of G, and I a natural number such that I < \lajp{wc)\- 
We define the extension Suf f (G, G, I) of G representing a prefix of wc as follows: 

— If I = 0, then Suf f (G, G, I) G. In the next cases we assume I > 0. 

— If I = \h.p{wc)\ then Suf f (G, G, /) contains G plus the rule G' [■], where G' is 

a new context non-terminal. In the next cases we assume I < |lip(wc')|- 
—If (G ^ G1G2) G G and / < |hp(u;ci)|. Then Suff(G,G,0 includes 

Suf f (G, Gi, which contains a context non-terminal G[ generating the suffix 

of wci with |hp(wc'j)| = |hp(«;ci)| — I, plus the rule G' — >■ G(G2, where G' is a 

new context non-terminal. 

— If (G G1G2) e G and / > |hp(wci)|, then, with /':=/- |hp(u'Ci)|, we define 

Suff(G,G,0 as Suff(G,G2,Z')- 
— If (G fiAi, Ai-i,Gi, Ai+i,. . . , Am)) e G, then we define Suf f (G, G, I) as 

Suff(G,Gi,/- 1). 

Lemma 5.6. Let G be an STG describing first- order terms and contexts. Let G 
be a context non-terminal of G, and I a natural number such that I < |hp(uic)|. 
Then, Suff(G, G, /) is an extension of G computable in time 0{\G\). It adds at 
most depth(G) non-terminals such that one of them, called G' , generates the suffix 
of Wc satisfying |hp(-u;c")l = |hp(wc')| — I- Moreover, depth(G') < depth(G) and 
depth(Pref (G, G, l)) = depth(G). 

Proof. The proof is analogous to the one of Lemma 5.4. □ 

Definition 5.7. Let G be an STG generating terms and contexts, let ^ be a 

term non-terminal of G, and let p be a position in wa- Then, we recursively define 
pCon(G, as an extension of G representing the prefix context of A with hole 
path p as follows. 

— if A — >■ a{Ai, . . . ,Am) G G and p = i ■ p' then pCon(G, includes 
pCon(G, which contains a non-terminal Gi generating the context pre- 

fix of with hp(w;cj = p', plus the rule G' — >■ a{Ai, . . . , Gj, . . . , Am), where 
G' is a new context non-terminal. 

— If A^ A', then pCon(G,^,p) = pCon(G, 
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— if A — )■ Ciy4.2 e G then p = pi • P2 where pi is the maximal common prefix of p 
and hp(Ci). We distinguish three cases: 

— if pi = hp(Ci) then pCon(G, A,p) includes pCon{G, A2,p2), which contains a 

non-terminal C2 generating the context prefix of WA2 with hp{wc2) — VI1 plus 
the rule C" — > C1C2, where C" is a new context non-terminal. 

— if p\ -< hp(Ci) and P2 = A then pCon(G', A,p) is defined as Pref (Ci, G, \pi\). 

— if pi -< hp(Gi) and p2 ^ X then p is of the form p\ ■ i ■ p^ and hp(Gi) is of 

the form pi ■ k ■ p4, for some positions and j>4, and some integers i and k 
satisfying i ^ k. We assume i < k, without loss of generality. Let h and U be 
\pi\ and \p4\, respectively. 

Let Gi be Pref (G, Gi, /i). The STG Gi contains a context non-terminal 
Gil generating the prefix of wci such that |hp(Gii)| = li. Let G2 be 
Suff(Gi,Gi, |hp(Ci)| — I4). The STG G2 contains a context non-terminal G12 

generating the suffix of wc^ such that |hp(Gi2)| = h- 

Let G[ be Suf f (G, Gi, /i). The STG G[ contains a context non-terminal G(2 
generating the suffix of Wci such that |hp(G(2)| = |hp(Ci)| — li = I4, + 1. Let 
G2 be Vref{G[,C'i2A)- The STG G2 contains a context non-terminal G(i 
generating the prefix of w^c;^ such that |lip(G(2^)| = 1. 

At this point, note that wg,Ci = ^G2,CiiWg'^,c{iWg2,Ci2- Moreover, the rule 
of G(i in G2 is of the form G(i -J> a{A[,. . . , A',._j^,C'{i, A'^._^^^, where 
all the A'^ are term non-terminals of the original G, and generating the same 
terms as in G. Moreover, the rule of Gfi in G2 is necessarily G"i [■]. 
We define pCon(G, as pCon(G2, A^jPs), which contains a context non- 
terminal G3 generating plus the rules C — > G11G4, G4 — > 
a{A[,...,A'^_^,C3,A'i^^,...,A'f^_^,A'^,A'^^^,...,A^), A'^ C12A2, where 
G'jC^, A'f. are new non- terminals. 

Lemma 5.8. Let G he an STG describing terms and contexts, let A he a non- 
terminal of G, and let p he a position in wa- Then, pCon(G, contains at 
most depth.(j4) * (2deptli(A) + 3) neiu non-terminals such that one of them, called 
C' , generates the context prefix of wa with hp(wc'') = p. Moreover, depth(G') < 
4depth(A). 

Proof. The fact that pCon(G, A,p) contains a context non-terminal G' gener- 
ating the context prefix of wa with hp(wc'/ ) = p can be verified by induction on 
depth(A) and distinguishing cases according to the definition of pCon{G,A,p). As 
in previous constructions, pCon(G, can be computed in a single run over the 
rules of the G. To show the upper bound to the size of the computed extension, it 
suffices to note that the worst case in this sense is when the rule of A is of the form 
A — >• G1A2 and hp(Gi) and p are disjoint. In such a case, we add 3 new rules plus 
the new rules in Pref (G, Gi, |pi|) and Suff (G, Gi, |pi| -|- 1), where pi is the maxi- 
mal common prefix between hp(Gi) and p. The number of added non-terminals is 
bounded by depth(A) for both the Pref and the Suff constructions by Lemma 5.4, 
and Lemma 5.6, respectively. The fact deptli(G') < 4deptli(A) can be verified by 
induction on depth(A) and using Lemma 5.4, and Lemma 5.6. □ 
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5.2 NP-completeness of STG-context-matching 

At this point we are ready to show that the STG-context-matching problem is in 
NP. However, wc will first remark on how wc represent the input and the solutions 
for this problem. An input consists on an STG G and two non-terminals Ag an At 
of G. We want to decide whether there exists a substitution a for the first-order 
and context variables occurring in wa^ such that a{'WAs) = WAf In the input of the 
algorithm, the first order and the context variables are 0-ary and 1-ary terminals 
of G, respectively. A solution a can be represented by another STG G', where 
the first order and the context variables are term and context non-terminals of G' , 
respectively. That is, a{x) = wg',x and (j{F) = wg',f, for each first-order variable 
X and context variable F. For proving NP inclusion, we just show that, if such a 
exists then there exists an extension G' of G, which is polynomially boundcid in the 
size of G, satisfying wcA^ = WG',At = WG,Af The fact that this equality can be 
checked in polynomial time follows again from Theorem 2.9. 

Lemma 5.9. Let G be an STG, and let Ag and At be term non-terminals of G. 
Let {As, Ai,G), be an STG-context-matching problem instance, and let a be a sub- 
stitution such that a{wG,As) = wc.At (a solution). Then, there exists an extension 
G' of G such that wg>,As = WG',At = WG,Af Furthermore, \G'\ is polynomially 
bounded by \G\. 

Proof. Let {xi, . . . , Xn} and {Fi, . . . , F^} be the set of first-order variables and 
context variables, respectively, occurring in wg,As- For each first order variable x,, 
a{xi) is a subterm of WAt at some position Pi. Thus, for each first-order variable Xj, 
we construct the STG = {TM'^^,CN'^^,T,'^.,R'^.) as kExt(At, G, plndex(i,pi)), 
which contains a term non-terminal A^^ generating wa^. = o'{xi). Then we convert 
Xi into a non- terminal generating a{xi) by defining Gxi = [TM xi,CM xi,'^xi, Rxi) 
fromG^. asGx. = {TN'^^yj{xi\,CM'^.,Y.'^.-{xi\,R'^.\J{xi A^.}). Similarly, for 
each context variable Fj, a{Fj) = G is a prefix context of some subterm oit = WAf 
Therefore, there exist positions qj,q'j satisfying that G is the prefix context oi t\qj 
with the hole at position q'j. Thus, for each context variable Fj, we construct 
G'p, = (T Af'p^ , CJV'p. , , R'p^ ) as pCon(kExt(^s, G, plndex(i, qj)), Ap^ , q'j), where 
kExt(As, G, plndex(t, (/j)) contains a term non-terminal generating t\q^, and 
G'p, contains a context non- terminal Cpj generating a{Fj). Then we convert Fj 
into a context non-terminal by defining Gpj = {TM Fj,CN Fj,^Fj,RFj) from G'p, 
as Gfj = {TM'p.,CJ\f'p. U {Fj},T,'p, - {Fj},R'p. U {Fj GfJ). Note that each 
extension of G that instantiates certain variable is independent from the others, 
since all of them ask for subterms/subcontexts of WAt, which is ground, and does 
not change after substituting a variable. Hence, each wa^. and each wcp. can 
be defined independently from the rest using the STG G given as input. In fact, 
without loss of gcinerality. wc; can assume that the new added non-terminals for 
each Gx and each Gf are disjoint. Thus, we construct G' as G' — (lj"=i 'T^x- U 

ur=i TAfF, , ur=i cAfx, u ur=i cn-f, , nr=i s., n 07=1 , ur=i r^. u u;1i rf, ) . 

By Lemma 4.5, each kExt(As, G, plndex(t,pi)) and each 
kExt{Ag,G,plB.dex{t,qj)) has at most depth(G) new non-terminals. 
By the same Lemma, each depth(kExt(As, G,plndex(i,pi))) and each 
depth(kExt(^s,G, plndex(f, gj))) is bounded by depth(G). Thus, each G^^ 
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has at most depth(G) + 1 new non-terminals, each depth.(Ga;J is bounded by 
depth(G) + 1. By Lemma 5.8, each pCon(kExt(As,G, plndex(t, g^)), A^^. has 
at most depth(G) * (2depth(G) + 3) new non-terminals. Thus, each has at 
most depth(G) -|- depth(G) * (2depth(G) -|- 3) -|- 1 new non-terminals, that is 
2 depth(G)2 + 4 depth(G) + 1. 

In order to count |G'|, by the assumption that all new added non-terminals were 
disjoint for each STG, we can just take the sum of their size increases with respect 
to G. Therefore, |G'| is bounded by |G| + n(depth(G) + 1) + m(2 depth(G)2 + 
4 depth(G) + 1). □ 

Theorem 5.10. Context matching with STGs is in NP and hence it is NP- 
complete. 

Proof. Let G be an STG, and let Ag and At be term non-terminals of G. 
In order to verify that a given extension G' of G represents a solution for the 
match equation {Ag = At,G} it suffices to decide whether wc^a, = wq'.Ai which 
can be done in polynomial time w.r.t |G'| by Theorem 2.9. By Lemma 5.9 if 
{As = At,G} has a solution a then there exists an extension of polynomial size 
w.r.t |G| representing a. Thus there is a polynomial time verifier for the STG- 
context-matching problem, and hence it belongs to NP. Since context matching is 
already known to be NP-hard [Schmidt-Schaufi and Schulz 1998] we obtain NP- 
completeness. □ 

For the special case of matching of strings compressed with SCFGs we obtain 
also NP-completeness: An instance of the matching problem for strings is a list of 
equations Si = ti, . . . , s„ = tn, where s,, are strings, only Si may contain string 
variables, and a solution cr may replace string variables by strings, and must solve 

all equations, i.e. (T{si) = ti for all i. 

Corollary 5.11. String-matching where left and right hand sides are com- 
pressed using an SCFG, is NP-complete. 

Proof. It is well-known that string matching is NP-hard [Benanav et al. 1985], 
and using a monadic signature, Theorem 5.10 shows the claim. □ 

6. FIRST-ORDER UNIFICATION WITH STGS 

In this section we prove that the first-order unification problem can be solved in 
polynomial time even when the input is compressed using STGs. We will use the 
algorithms and constructions in Section 4, where the polynomial running time of 
certain constructions there is now relevant. 

Definition 6.1. The first-order unification problem with STG has an STG G rep- 
resenting first-order terms and contexts as input, plus two term non-terminals Ag 
and At of G representing terms s = wg,a, and t = WG,Af Its decisional version 
asks whether s and t are unifiable. In the affirmative case, its computational version 
asks for a representation of the most general unifier. 

Our algorithm generates the most general unifier in polynomial time and repre- 
sents it again with an STG. It will make heavy use of the grammar-constructions 
in Section 4. 
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Input : An STG G and term non-terminals As and At . 

(we write s and t for wa^ and tuytj)- 
While s and t are different do: 

Look for the first position k such that pre(s)[fc] 7^ pre(t)[fe] . 
If both pre(s)[A;] and pre(t)[A;] are function symbols; Then 

Halt stating that the initial s and t are not imifiable 
// Here, either pre(s)[fe] or pre(t)[fe], say pre(s)[fe], is a variable x. 
If X occurs in t\p, where p = iPos(t, fc) , Then 

Halt stating that the initial s and t are not unifiable 
Extend G by the assignment {x t\p} 
EndUhile 

Halt stating that the initial s and t are unifiable 

Fig. 6. Unification Algorithm of STG-Compressed Terms 

6.1 Outline of the algorithm 

Our unification algorithm for compressed terms in Fig. 6 is a variant of Robinson's 
algorithm [Robinson 1965]: Given an STG G as a compressed representation of two 
terms s and t, we compute a smallest index k in which pre(s) and pre(t) differ. 
At this point, if both pre(s)[fc] and pre(t)[fc] are function symbols, we terminate 
stating non-miifiability. Otherwise, either pre(s) or pre(f), say pre(,s), contains a 
variable x at k. Note that, since the arity of the terminals in G is fixed, the index 
k corresponds to a unique position p G Pos(s) fl Pos(f), as explained in Section 2. 
If X properly occurs in the subterm of t at p, then we terminate, again stating 
non-unifiability. Otherwise, we replace x by the subterm of f at p everywhere, 
and repeat the process until both s and t become equal, in which case we state 
unifiability. 

6.2 A polynomial time algorithm for first-order unification with STGs 

From a high level perspective the structure of our algorithm described in Subsec- 
tion 6.1 is very simple and rather standard: it is very much like the Robinson 
unification algorithm [Robinson 1965]. Many algorithms for first-order unifica- 
tion are variants of this scheme. They represent the terms with directed acyclic 
graphs (dags), implemented somehow, in order to avoid the space explosion due 
to the repeated instantiation of variables by terms. For example, the Martelli- 
Montanari-algorithm represents instantiations by equations [Martelli and Monta- 
nari 1982; Baader and Snyder 2001b]. In our setting, those terms are represented 
by STGs. In fact, the input is an STG G, and two term non-terminals Ag and At 
representing s and t, respectively. In previous sections we showed how to efficiently 
perform all the required operations on STGs: Decide whether s and t are equal, 
generate a compressed representation for pre(s) and pre(t), look for the smallest in- 
dex k such that pre(s)[A;] ^ pre(s)[A:], construct the term t\p, where p = iPos(f, A;), 
and instantiate the variable x = s\p hy t\p. 

The algorithm runs in polynomial time due to the following observations. Let n 
and m be the initial value of depth(G) and |G|, respectively. We define V to be the 
set of all the first-order variables at the start of the execution (before any of them 
has been converted into a non- terminal). Hence, at this point Vdepth(G) = n. The 
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value Vdepth(G') is preserved to be n along the execution of the algorithm thanks 
to Lemmas 4.8 and 4.10. Moreover, by Lemma 4.10, at most n new non-terminals 

arc added at each step. Since at most \V\ steps arc executed, the final size of G is 
bounded by to + \V\n. Each execution step takes time at most OdGp). Thus we 
have proved: 

Theorem 6.2. First-order unification of two terms represented by an STG can 

be done in polynomial time 0{\V\{m + \V\n)^ , where m represents the size of the 
input STG, n represents the depth, and V represents the set of different first-order 
variables occurring in the input terms). This holds for the decision question, as 
well as for the computation of the most general unifier, whose components are 
represented by the final STG. 

7. FIRST-ORDER MATCHING WITH STGS 

In this section we prove that the first-order matching problem can be solved in 
polynomial time even when the input is compressed using STGs. 

Definition 7.1. The first-order matching problem with STG has an STG G rep- 
resenting first-order terms and contexts as input, plus two term non-terminals Ag 
and At of G representing terms s = wg,As and t = WG,At: where t is ground. Its 
decisional version asks for the existence of a substitution a such that a{s) = t 
whereas its computational version asks for a representation of a. 

First-order matching is a particular case of first-order unification. However, tak- 
ing advantage of the fact that one of the terms is ground leads to a faster algorithm 
with respect to the one presented in the previous section. We also improve previous 
complexity results for this problem [Gascon et al. 2008] . 

7.1 Outline of the algorithm 

The structure of our algorithm is sketched in Figure 7. As commented above, the 
input of the problem consists of an STG G as a compressed representation of two 
terms s and t. As in the first-order unification case, the algorithm works with rep- 
resentations of the preorder traversal words of the terms s and t to be matched. 
Hence, we first compute a representation of pre(s) and pre(t). Then we find the 
index k of the first occurrence of a variable x in pre(s), and, given G and k, com- 
pute t' = t|iPos(t,fc)- If t' is undefined we halt giving a negative answer. Otherwise 
we apply the substitution {x t'}{s) and restart the process until all variables are 
replaced. Finally, let s' be the term obtained from s after all replacements are done. 
We check whether s' and t are syntactically equal and answer accordingly. Note 
that, in contrast to unification algorithm, we look for the first occurrence of a vari- 
able in pre(s) instead of looking for the first diff'erence between pre(s) and pre(i). 
This refines the approach used in the previous section for the unification general 
case of first-order unification and improves time complexity results in previous work 
on first-order matching with STGs [Gascon et al. 2008] . 

In the previous section we already showed how to compute a succinct represen- 
tation of pre(s) and pre(i), to compute, given a natural number k, the subterm of 
a term t at position iPos{t, k), and to apply a substitution. Hence, it only remains 
to show how to compute fc, the index of the first occurrence of a variable in pre(s). 
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Input : An STG G and term non-terminals As and At . 

(we write s and t for wa^ and 
and X for the set of variables in s) . 
Repeat \X\ times: 

Look for the smallest index k such that pre(s)[A;] = x X . 

If iPos(t, fc) is undefined Then Halt stating that the initial s and t match. 
Extend G by the assignment {x ^ t\p} , where p = iPos(t, A;) . 
EndRepeat 

li s = t Then Halt stating that the initial s and t match. 
Else Halt stating that the initial s and t do not match. 



Fig. 7. Matching Algorithm for STG-Compressed Terms 



7.2 Finding the first occurrence of a variable 

The task of finding the index of the first occurrence of a variable in a compressed 
word can be performed efficiently as stated in the following Lemma. 

Lemma 7.2. Let P be an SCFG, and let p he a non-terminal of P representing 
the preorder traversal word of a first-order term. Then, the smallest index k such 
that Wp[k] is a terminal and a variable can be computed in time 0{\P\). 

Proof. Let X denote the set of first-order variables. We define k = index(p, P) 
as follows: 

1 ,ifp^aePAaeX 
index(pi,P) , if (p ^ P1P2) € P A 

index(y»,P)= ^ 3x € X : x occurs in wp^p^ 

|w^p,Pil + index(X2,P) , Otherwise. 

Note that we assumed that P is in Chomsky Normal Form. If this was not the 
case, we can force this assumption with a linear time and space transformation. The 
fact that index(p, P) computes the smallest index k such that Wp[k] is a variable 
can be shown by induction on depth(p). With respect to the time complexity, for 
each non-terminal p of an SCFG P, both the number \wp\ and whether Wp contains 
a variable can be precomputed in linear time as stated in Lemmas 2.10 and 2.11, 
respectively. When these pre-computations are done, index(p, P) can be computed 
by a single run over the rules of P and hence, it runs also in linear time. □ 

7.3 A polynomial time algorithm for first-order matching with STGs 

The algorithm presented in the previous section runs in polynomial time due to 
the following observations. Let n and m be the initial value of depth(G) and 
\G\, respectively. We define := A" to be the set of all the first-order variables 
at the start of the execution (before any of them has been converted into a non- 
terminal). As in the unification case, the final size of the grammar is bounded 
by m + \V\n thanks to Lemmas 4.8 and 4.10. Our algorithms iterates at most V 
times. By Lemmas 4.5, and 7.2 each iteration takes linear time. Finally we check 
equality of two words generated by an SCFG P, which takes time 0{\P\^) thanks 
to Theorem 2.9. Hence, we have the following: 
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Theorem 7.3. First-order matching of two terms represented by an STG can 
he done in polynomial time 0{{m + \V\n)^), where m represents the size of the 
inputted STG, n represents its depth, and V represents the set of different first- order 
variables occurring in the inputted terms). This holds for the decision question, as 
well as for the computation of the unifier, whose components are represented by the 
final STG. 

8. CONCLUSION AND FURTHER WORK 

We analyzed the complexity of context matching under different representations of 
terms like dags and STGs. Regarding the term compression using STGs, we showed 
that the context matching problem with STGs is NP-complete. Furthermore, we 
presented instantiation-based algorithms for the first-order matching problem and 
the first-order unification problem, that can be immediately executed on the com- 
pressed representation of large terms and run in polynomial time on the size of 
the representation. It would be interesting to investigate optimizations for these 
algorithms, as well as finding an improved upper bound. We also believe that it 
would be natural to consider the context matching problem using an STG encoding 
for terms under certain restrictions like fixing the number of context variables. In 
this sense we believe that our techniques could be useful to show that the one con- 
text unification problem is in NP when the input is represented by an STG. This 
problem has been solved for plain terms as input in [Gascon et al. 2008]. 

For the dag representation we found a polynomial context matching algorithm 
for the case where the number of context variables is fixed. Since the problem of 
context matching is NP-complete, this result is interesting because it closely links 
a complexity jump to a specific restriction on the original problem. 

From a more general point of view, we believe that it would be interesting to 
investigate other variations of context unification and matching problems. Modifi- 
cations such as allowing several holes in a context add expressiveness to the problem 
in order to encode complex questions about terms. It is also important to recon- 
sider the complexity of sets of solutions for this variations when using different 
representations of terms. 
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